Targeted Risk Understanding \& Scoring Technology (TRUST)

A Data Science Project on Home Credit Risk Analysis

No description has been provided for this image

Image obtained from LinkedIn

In [ ]:
# @title TRUST
from IPython.display import IFrame

# Embed YouTube video
IFrame(src="https://www.youtube.com/embed/fBDyLng_aJw?si=WG24rSzmf7bYAmnS",
       width=1680,
       height=945)
Out[ ]:

Team TRUIST¶

This project is a collaborative effort by the team TRUIST for the University of Maryland as part of the Principles of Data Science course. All code and prose have been developed by the team members: Arunbh Yashaswi, Swattik Maiti, Eniyan Ezhilan, Ajaykumar Balakannan, and Ritik Pratap Singh. The knowledge and skills used to complete this study were acquired through the course, academic materials, and credible internet sources.

The study conducted in this project concerns credit risk analysis in the financial domain. The purpose is to provide an impartial and data-driven exploration into the factors affecting loan defaults. This project does not promote any specific viewpoint or financial institution but aims to contribute to a better understanding of credit risk assessment.


Final Report:¶

Trust Home Credit Loan Default Analysis Final Report


Project Source Code:¶

GitHub Repository: TRUST

Table of Contents¶

  • Introduction

  • Part 1: Data Collection

  • Part 2: Data Cleaning

  • Part 3: Exploratory Data Analysis

  • Part 4: Hypothesis Testing

  • Part 5: Feature Creation, Feature Engineering and Feature Selection

  • Part 6:Top Feature Selection

  • Part 7: Machine Learning Methodology and Architecture - Model Stacking

-Part 8: Final Model Curation (Meta Model)
-Part 9: Conclusion

Introduction¶

In today’s rapidly evolving financial landscape, accurately predicting credit risk is a significant challenge faced by financial institutions worldwide. Credit risk refers to the possibility of a borrower failing to meet their repayment obligations, leading to financial losses for lenders. To mitigate this risk, credit scoring systems have traditionally relied on historical credit data, such as repayment history and credit utilization. While these methods have proven effective for many borrowers, they fall short when evaluating first-time borrowers or individuals with limited credit histories. This limitation creates significant gaps in financial inclusion, leaving many deserving individuals without access to credit. The purpose of this project is to analyze loan defaults using a comprehensive dataset provided by Home Credit Group. This dataset captures various financial, demographic, and behavioral attributes of borrowers, offering an opportunity to go beyond traditional credit scoring metrics. By leveraging advanced data science techniques, we aim to uncover meaningful insights and develop predictive models that address the limitations of existing credit risk frameworks.

Note Comprehensive Documentation and Final Dataset¶

Throughout this project, we have tackled multiple stages of data preparation, feature engineering, exploratory analysis, hypothesis testing, and machine learning model development. Each stage required meticulous work and detailed explanations. To keep everything organized and easy to follow, we have provided links to the relevant notebooks in each section of this report. These notebooks are hosted on our GitHub repository and serve as standalone guides for each process we implemented.

👉 Explore the detailed notebooks on our GitHub Repository.

Why Everything is Not in One Notebook¶

Given the complexity and breadth of this project, it was not practical to consolidate all the work into a single notebook. Each stage of the process—data cleaning, exploratory data analysis, feature engineering, feature selection, and model development—required detailed code, extensive computation, and specific explanations tailored to that step.


Final Training Dataset: ultimate_op_dataset¶

After completing all the steps of data cleaning, feature engineering, and feature selection, we created the ultimate_op_dataset, which represents the final, consolidated dataset used for training our machine learning models.

👉 Download the ultimate_op_dataset from GitHub


Initial Hypothesis¶

We aim to identify key factors that influence loan defaults and assess how effectively machine learning models can predict borrower risk. This study will cover the entire data science lifecycle, from data collection and preprocessing to feature engineering, model development, and final evaluation. Our goal is to create a robust framework that financial institutions can utilize to make informed credit decisions while addressing gaps in existing scoring models. Specifically, we will investigate the following:

  • Key Predictors: Analyze the relative importance of features such as amount financed, delinquency, loan tenure, past loan perfomance, employment stability, and income in predicting loan defaults.

    • Example: Applicants who have a high loan amount relative to their income or a history of delinquency are more likely to default. Similarly, borrowers with shorter loan tenures and consistent past loan repayment behavior may exhibit lower default risks, while those with unstable employment histories or irregular income patterns are at higher risk of non-repayment..
  • Model Performance: Compare the effectiveness of machine learning models like logistic regression, random forests, and gradient boosting in predicting loan default probabilities.

    • Example: Models trained on non-traditional features (e.g., behavioral data) may outperform those relying solely on traditional financial metrics.

This project seeks to improve the accuracy of credit scoring, reduce lending risks, and foster financial inclusion by leveraging modern data science techniques.

Why Is This Important?¶

Credit risk assessment plays a crucial role in maintaining financial stability while promoting responsible lending practices. The landscape of credit evaluation is shifting, with traditional methods often failing to account for the complexities of modern borrower profiles. This study aims to offer a data-driven approach to understanding the factors that lead to loan defaults, helping financial institutions refine their risk models and extend credit responsibly. By improving predictive accuracy, this analysis can contribute to broader financial inclusion, reduce default rates, and foster greater trust between lenders and borrowers.

Importing Relevant libraries¶

We begin by importing the necessary Python libraries for data manipulation, analysis, visualization, and machine learning. These libraries will enable efficient data exploration and help us build robust predictive models.

In [ ]:
!pip install dask[dataframe]
!pip install tensorflow-addons
Requirement already satisfied: dask[dataframe] in /usr/local/lib/python3.10/dist-packages (2024.10.0)
Requirement already satisfied: click>=8.1 in /usr/local/lib/python3.10/dist-packages (from dask[dataframe]) (8.1.7)
Requirement already satisfied: cloudpickle>=3.0.0 in /usr/local/lib/python3.10/dist-packages (from dask[dataframe]) (3.1.0)
Requirement already satisfied: fsspec>=2021.09.0 in /usr/local/lib/python3.10/dist-packages (from dask[dataframe]) (2024.10.0)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from dask[dataframe]) (24.2)
Requirement already satisfied: partd>=1.4.0 in /usr/local/lib/python3.10/dist-packages (from dask[dataframe]) (1.4.2)
Requirement already satisfied: pyyaml>=5.3.1 in /usr/local/lib/python3.10/dist-packages (from dask[dataframe]) (6.0.2)
Requirement already satisfied: toolz>=0.10.0 in /usr/local/lib/python3.10/dist-packages (from dask[dataframe]) (0.12.1)
Requirement already satisfied: importlib-metadata>=4.13.0 in /usr/local/lib/python3.10/dist-packages (from dask[dataframe]) (8.5.0)
Requirement already satisfied: pandas>=2.0 in /usr/local/lib/python3.10/dist-packages (from dask[dataframe]) (2.2.2)
Requirement already satisfied: dask-expr<1.2,>=1.1 in /usr/local/lib/python3.10/dist-packages (from dask[dataframe]) (1.1.16)
Requirement already satisfied: pyarrow>=14.0.1 in /usr/local/lib/python3.10/dist-packages (from dask-expr<1.2,>=1.1->dask[dataframe]) (17.0.0)
Requirement already satisfied: zipp>=3.20 in /usr/local/lib/python3.10/dist-packages (from importlib-metadata>=4.13.0->dask[dataframe]) (3.21.0)
Requirement already satisfied: numpy>=1.22.4 in /usr/local/lib/python3.10/dist-packages (from pandas>=2.0->dask[dataframe]) (1.26.4)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas>=2.0->dask[dataframe]) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=2.0->dask[dataframe]) (2024.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.10/dist-packages (from pandas>=2.0->dask[dataframe]) (2024.2)
Requirement already satisfied: locket in /usr/local/lib/python3.10/dist-packages (from partd>=1.4.0->dask[dataframe]) (1.0.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas>=2.0->dask[dataframe]) (1.17.0)
Collecting tensorflow-addons
  Downloading tensorflow_addons-0.23.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.8 kB)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from tensorflow-addons) (24.2)
Collecting typeguard<3.0.0,>=2.7 (from tensorflow-addons)
  Downloading typeguard-2.13.3-py3-none-any.whl.metadata (3.6 kB)
Downloading tensorflow_addons-0.23.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (611 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 611.8/611.8 kB 9.0 MB/s eta 0:00:00
Downloading typeguard-2.13.3-py3-none-any.whl (17 kB)
Installing collected packages: typeguard, tensorflow-addons
  Attempting uninstall: typeguard
    Found existing installation: typeguard 4.4.1
    Uninstalling typeguard-4.4.1:
      Successfully uninstalled typeguard-4.4.1
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
inflect 7.4.0 requires typeguard>=4.0.1, but you have typeguard 2.13.3 which is incompatible.
Successfully installed tensorflow-addons-0.23.0 typeguard-2.13.3
In [ ]:
# Standard Python libraries for data manipulation and analysis
import numpy as np
import pandas as pd

# Libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

# Libraries for data preprocessing
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score

# Libraries for machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report, confusion_matrix

# Advanced machine learning libraries
import lightgbm as lgbm
import xgboost as xgb
from xgboost import XGBClassifier

# Libraries for statistical analysis
from scipy.stats import *

# Additional utilities
from collections import Counter
from os import truncate
import warnings
warnings.filterwarnings("ignore")
from google.colab import drive
import os
# Libraries for big data processing (if needed)
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.classification import LogisticRegression as SparkLogisticRegression, RandomForestClassifier as SparkRandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

%matplotlib inline


%matplotlib inline
In [ ]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score
In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

Data Loading¶

Obtaining Data from Kaggle¶

The dataset for this project was sourced from the Home Credit Default Risk competition on Kaggle. This dataset provides detailed information about loan applicants, their financial and demographic profiles, and their repayment histories. It serves as the foundation for building predictive models to assess credit risk.

After downloading the dataset from Kaggle, the files were uploaded to Google Drive for seamless integration with the Google Colab environment. This setup allows for efficient data access and processing directly in the notebook.

We then loaded each file into separate pandas DataFrames for analysis. This approach ensures modularity and allows for targeted preprocessing of individual tables. Below is the code snippet illustrating how the data was imported:

In [ ]:
def data_loader(filepath: str) -> dict:

    try:
        # Dictionary to store loaded DataFrames
        dataframes = {}

        # List of file names to load
        files_to_load = [
            "application_train.csv",
            "bureau.csv",
            "bureau_balance.csv",
            "credit_card_balance.csv",
            "installments_payments.csv",
            "POS_CASH_balance.csv",
            "previous_application.csv",
        ]

        # Load each file and add to the dictionary
        for file in files_to_load:
            df_name = file.split(".")[0]
            dataframes[df_name] = pd.read_csv(os.path.join(filepath, file))
            print(f"Pandas DataFrame '{df_name}' loaded successfully.")

        return dataframes

    except FileNotFoundError as fnf_error:
        print(f"File not found: {fnf_error}")
    except pd.errors.EmptyDataError as ede:
        print(f"Empty data error: {ede}")
    except Exception as e:
        print(f"An error occurred while loading data: {e}")

    return None

# Mount Google Drive
drive.mount('/content/drive')
data_path = '/content/drive/MyDrive/home-credit-default-risk/'

# Use the data_loader function to load all datasets
dataframes = data_loader(data_path)

# Access individual DataFrames
if dataframes:
    appl_train = dataframes["application_train"]
    bureau_df = dataframes["bureau"]
    bureau_balance_df = dataframes["bureau_balance"]
    credit_card_df = dataframes["credit_card_balance"]
    installments_df = dataframes["installments_payments"]
    pos_cash_df = dataframes["POS_CASH_balance"]
    previous_application_df = dataframes["previous_application"]

    # Display the first few rows of one of the tables
    print(appl_train.head())
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Pandas DataFrame 'application_train' loaded successfully.
Pandas DataFrame 'bureau' loaded successfully.
Pandas DataFrame 'bureau_balance' loaded successfully.
Pandas DataFrame 'credit_card_balance' loaded successfully.
Pandas DataFrame 'installments_payments' loaded successfully.
Pandas DataFrame 'POS_CASH_balance' loaded successfully.
Pandas DataFrame 'previous_application' loaded successfully.
   SK_ID_CURR  TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR  \
0      100002       1         Cash loans           M            N   
1      100003       0         Cash loans           F            N   
2      100004       0    Revolving loans           M            Y   
3      100006       0         Cash loans           F            N   
4      100007       0         Cash loans           M            N   

  FLAG_OWN_REALTY  CNT_CHILDREN  AMT_INCOME_TOTAL  AMT_CREDIT  AMT_ANNUITY  \
0               Y             0          202500.0    406597.5      24700.5   
1               N             0          270000.0   1293502.5      35698.5   
2               Y             0           67500.0    135000.0       6750.0   
3               Y             0          135000.0    312682.5      29686.5   
4               Y             0          121500.0    513000.0      21865.5   

   ...  FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21  \
0  ...                 0                0                0                0   
1  ...                 0                0                0                0   
2  ...                 0                0                0                0   
3  ...                 0                0                0                0   
4  ...                 0                0                0                0   

  AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY  \
0                        0.0                       0.0   
1                        0.0                       0.0   
2                        0.0                       0.0   
3                        NaN                       NaN   
4                        0.0                       0.0   

   AMT_REQ_CREDIT_BUREAU_WEEK  AMT_REQ_CREDIT_BUREAU_MON  \
0                         0.0                        0.0   
1                         0.0                        0.0   
2                         0.0                        0.0   
3                         NaN                        NaN   
4                         0.0                        0.0   

   AMT_REQ_CREDIT_BUREAU_QRT  AMT_REQ_CREDIT_BUREAU_YEAR  
0                        0.0                         1.0  
1                        0.0                         0.0  
2                        0.0                         0.0  
3                        NaN                         NaN  
4                        0.0                         0.0  

[5 rows x 122 columns]

Basic Data Exploration¶

In [ ]:
## Some utility functions

#Table shape
def pandas_df_shape(df) -> None:
  """
  This function is responsible for accepting a pysaprk dataframe and printing out the number of rows and columns of the dataset
  """
  num_rows = df.shape[0]  # Get number of rows
  num_cols = df.shape[1] # Get number of columns

  # Print the shape (rows, columns)
  print(f"Shape of DataFrame : ({num_rows}, {num_cols})")

#funct to check skewness in each column
def check_skewness(df: pd.DataFrame, col_name: str) -> None:
    print(df[col_name].value_counts())
    percentage_counts = df[col_name].value_counts(normalize=True) * 100
    print(percentage_counts)

#func to check fillrate for each col of whole dataset
def calculate_fill_rate(df: pd.DataFrame) -> pd.DataFrame:
    #check fillrate
    fill_rates = {
        'Column Name': [],
        'Fill Rate (%)': []
    }
    total_rows = df.shape[0]

    for column in df.columns:
        non_null_count = df[column].notnull().sum()
        fill_rate = (non_null_count / total_rows) * 100
        fill_rates['Column Name'].append(column)
        fill_rates['Fill Rate (%)'].append(fill_rate)

    fill_rate_df = pd.DataFrame(fill_rates)

    # # Save the fill rate DataFrame to a CSV file
    # fill_rate_df.to_csv(output_file, index=False)

    return pd.DataFrame(fill_rate_df)


# func to make a plot to check distribution of a variable
def plot_distribution(df: pd.DataFrame, target_variable : str) -> None:
    plt.figure(figsize=(15, 6))

    # Histogram
    plt.subplot(1, 3, 1)
    sns.histplot(df[target_variable], bins=30, kde=False, color='blue', edgecolor='black')
    plt.title(f'Histogram of {target_variable}')
    plt.xlabel(target_variable)
    plt.ylabel('Frequency')

    # KDE Plot
    plt.subplot(1, 3, 2)
    sns.kdeplot(df[target_variable], color='red', fill=True)
    plt.title(f'KDE Plot of {target_variable}')
    plt.xlabel(target_variable)
    plt.ylabel('Density')

    # Box Plot
    plt.subplot(1, 3, 3)
    sns.boxplot(x=df[target_variable], color='lightgray')
    plt.title(f'Box Plot of {target_variable}')
    plt.xlabel(target_variable)

    plt.tight_layout()
    plt.show()

The above utility functions simplify data exploration by analyzing DataFrame shape, skewness, fill rates, and variable distributions. These tools provide quick insights into data quality and structure, ensuring efficient preparation for analysis.

In [ ]:
# List of tables to analyze
dataframes_list = [
    appl_train,
    bureau_df,
    bureau_balance_df,
    credit_card_df,
    installments_df,
    pos_cash_df,
    previous_application_df
]

# Example: Use utility functions on all DataFrames
for idx, df in enumerate(dataframes_list, start=1):
    print(f"\n--- Table {idx}: Analysis ---\n")

    # Check shape of the DataFrame
    pandas_df_shape(df)

    # Calculate fill rates for the DataFrame
    print("\nFill Rates:")
    fill_rate_df = calculate_fill_rate(df)
    print(fill_rate_df)

    # Example column analysis (e.g., target column or specific feature)
    if 'TARGET' in df.columns:
        print("\nDistribution of TARGET variable:")
        plot_distribution(df, 'TARGET')

    # Skewness check (Example: a categorical column)
    if 'NAME_CONTRACT_TYPE' in df.columns:
        print("\nSkewness in NAME_CONTRACT_TYPE:")
        check_skewness(df, 'NAME_CONTRACT_TYPE')
--- Table 1: Analysis ---

Shape of DataFrame : (307511, 122)

Fill Rates:
                    Column Name  Fill Rate (%)
0                    SK_ID_CURR     100.000000
1                        TARGET     100.000000
2            NAME_CONTRACT_TYPE     100.000000
3                   CODE_GENDER     100.000000
4                  FLAG_OWN_CAR     100.000000
..                          ...            ...
117   AMT_REQ_CREDIT_BUREAU_DAY      86.498369
118  AMT_REQ_CREDIT_BUREAU_WEEK      86.498369
119   AMT_REQ_CREDIT_BUREAU_MON      86.498369
120   AMT_REQ_CREDIT_BUREAU_QRT      86.498369
121  AMT_REQ_CREDIT_BUREAU_YEAR      86.498369

[122 rows x 2 columns]

Distribution of TARGET variable:
No description has been provided for this image
Skewness in NAME_CONTRACT_TYPE:
NAME_CONTRACT_TYPE
Cash loans         278232
Revolving loans     29279
Name: count, dtype: int64
NAME_CONTRACT_TYPE
Cash loans         90.478715
Revolving loans     9.521285
Name: proportion, dtype: float64

--- Table 2: Analysis ---

Shape of DataFrame : (1716428, 17)

Fill Rates:
               Column Name  Fill Rate (%)
0               SK_ID_CURR     100.000000
1             SK_ID_BUREAU     100.000000
2            CREDIT_ACTIVE     100.000000
3          CREDIT_CURRENCY     100.000000
4              DAYS_CREDIT     100.000000
5       CREDIT_DAY_OVERDUE     100.000000
6      DAYS_CREDIT_ENDDATE      93.850427
7        DAYS_ENDDATE_FACT      63.083042
8   AMT_CREDIT_MAX_OVERDUE      34.486736
9       CNT_CREDIT_PROLONG     100.000000
10          AMT_CREDIT_SUM      99.999243
11     AMT_CREDIT_SUM_DEBT      84.988068
12    AMT_CREDIT_SUM_LIMIT      65.522585
13  AMT_CREDIT_SUM_OVERDUE     100.000000
14             CREDIT_TYPE     100.000000
15      DAYS_CREDIT_UPDATE     100.000000
16             AMT_ANNUITY      28.526510

--- Table 3: Analysis ---

Shape of DataFrame : (27299925, 3)

Fill Rates:
      Column Name  Fill Rate (%)
0    SK_ID_BUREAU          100.0
1  MONTHS_BALANCE          100.0
2          STATUS          100.0

--- Table 4: Analysis ---

Shape of DataFrame : (3840312, 23)

Fill Rates:
                   Column Name  Fill Rate (%)
0                   SK_ID_PREV     100.000000
1                   SK_ID_CURR     100.000000
2               MONTHS_BALANCE     100.000000
3                  AMT_BALANCE     100.000000
4      AMT_CREDIT_LIMIT_ACTUAL     100.000000
5     AMT_DRAWINGS_ATM_CURRENT      80.475128
6         AMT_DRAWINGS_CURRENT     100.000000
7   AMT_DRAWINGS_OTHER_CURRENT      80.475128
8     AMT_DRAWINGS_POS_CURRENT      80.475128
9      AMT_INST_MIN_REGULARITY      92.051792
10         AMT_PAYMENT_CURRENT      80.001937
11   AMT_PAYMENT_TOTAL_CURRENT     100.000000
12    AMT_RECEIVABLE_PRINCIPAL     100.000000
13               AMT_RECIVABLE     100.000000
14        AMT_TOTAL_RECEIVABLE     100.000000
15    CNT_DRAWINGS_ATM_CURRENT      80.475128
16        CNT_DRAWINGS_CURRENT     100.000000
17  CNT_DRAWINGS_OTHER_CURRENT      80.475128
18    CNT_DRAWINGS_POS_CURRENT      80.475128
19   CNT_INSTALMENT_MATURE_CUM      92.051792
20        NAME_CONTRACT_STATUS     100.000000
21                      SK_DPD     100.000000
22                  SK_DPD_DEF     100.000000

--- Table 5: Analysis ---

Shape of DataFrame : (13605401, 8)

Fill Rates:
              Column Name  Fill Rate (%)
0              SK_ID_PREV     100.000000
1              SK_ID_CURR     100.000000
2  NUM_INSTALMENT_VERSION     100.000000
3   NUM_INSTALMENT_NUMBER     100.000000
4         DAYS_INSTALMENT     100.000000
5      DAYS_ENTRY_PAYMENT      99.978648
6          AMT_INSTALMENT     100.000000
7             AMT_PAYMENT      99.978648

--- Table 6: Analysis ---

Shape of DataFrame : (10001358, 8)

Fill Rates:
             Column Name  Fill Rate (%)
0             SK_ID_PREV     100.000000
1             SK_ID_CURR     100.000000
2         MONTHS_BALANCE     100.000000
3         CNT_INSTALMENT      99.739325
4  CNT_INSTALMENT_FUTURE      99.739165
5   NAME_CONTRACT_STATUS     100.000000
6                 SK_DPD     100.000000
7             SK_DPD_DEF     100.000000

--- Table 7: Analysis ---

Shape of DataFrame : (1670214, 37)

Fill Rates:
                    Column Name  Fill Rate (%)
0                    SK_ID_PREV     100.000000
1                    SK_ID_CURR     100.000000
2            NAME_CONTRACT_TYPE     100.000000
3                   AMT_ANNUITY      77.713335
4               AMT_APPLICATION     100.000000
5                    AMT_CREDIT      99.999940
6              AMT_DOWN_PAYMENT      46.363520
7               AMT_GOODS_PRICE      76.918227
8    WEEKDAY_APPR_PROCESS_START     100.000000
9       HOUR_APPR_PROCESS_START     100.000000
10  FLAG_LAST_APPL_PER_CONTRACT     100.000000
11       NFLAG_LAST_APPL_IN_DAY     100.000000
12            RATE_DOWN_PAYMENT      46.363520
13        RATE_INTEREST_PRIMARY       0.356302
14     RATE_INTEREST_PRIVILEGED       0.356302
15       NAME_CASH_LOAN_PURPOSE     100.000000
16         NAME_CONTRACT_STATUS     100.000000
17                DAYS_DECISION     100.000000
18            NAME_PAYMENT_TYPE     100.000000
19           CODE_REJECT_REASON     100.000000
20              NAME_TYPE_SUITE      50.880246
21             NAME_CLIENT_TYPE     100.000000
22          NAME_GOODS_CATEGORY     100.000000
23               NAME_PORTFOLIO     100.000000
24            NAME_PRODUCT_TYPE     100.000000
25                 CHANNEL_TYPE     100.000000
26             SELLERPLACE_AREA     100.000000
27         NAME_SELLER_INDUSTRY     100.000000
28                  CNT_PAYMENT      77.713634
29             NAME_YIELD_GROUP     100.000000
30          PRODUCT_COMBINATION      99.979284
31           DAYS_FIRST_DRAWING      59.701871
32               DAYS_FIRST_DUE      59.701871
33    DAYS_LAST_DUE_1ST_VERSION      59.701871
34                DAYS_LAST_DUE      59.701871
35             DAYS_TERMINATION      59.701871
36    NFLAG_INSURED_ON_APPROVAL      59.701871

Skewness in NAME_CONTRACT_TYPE:
NAME_CONTRACT_TYPE
Cash loans         747553
Consumer loans     729151
Revolving loans    193164
XNA                   346
Name: count, dtype: int64
NAME_CONTRACT_TYPE
Cash loans         44.757917
Consumer loans     43.656142
Revolving loans    11.565225
XNA                 0.020716
Name: proportion, dtype: float64

This analysis provides a detailed overview of multiple datasets, highlighting their structure, fill rates, and distribution trends. Each table includes the shape of the DataFrame, with the number of rows and columns clearly defined. Fill rates for individual columns are calculated, identifying areas with missing data. Key categorical variables, such as NAME_CONTRACT_TYPE, are analyzed for skewness, providing insights into data imbalances. The breakdown allows for a focused understanding of data quality and distribution, essential for subsequent preprocessing and modeling steps

Data Curation and Cleaning¶

In [ ]:
fill_rate_df_app_train = calculate_fill_rate(appl_train)
fill_rate_df_app_train
Out[ ]:
Column Name Fill Rate (%)
0 SK_ID_CURR 100.000000
1 TARGET 100.000000
2 NAME_CONTRACT_TYPE 100.000000
3 CODE_GENDER 100.000000
4 FLAG_OWN_CAR 100.000000
... ... ...
117 AMT_REQ_CREDIT_BUREAU_DAY 86.498369
118 AMT_REQ_CREDIT_BUREAU_WEEK 86.498369
119 AMT_REQ_CREDIT_BUREAU_MON 86.498369
120 AMT_REQ_CREDIT_BUREAU_QRT 86.498369
121 AMT_REQ_CREDIT_BUREAU_YEAR 86.498369

122 rows × 2 columns

In [ ]:
appl_train.drop_duplicates(inplace=True)

4.1 Data Imputation¶

The Plan is to take only the columns which have > 50% fillrate but < 100% fillrate and then find a way to impute the missing values. Right now I am not going to bother dealing with the columns with <= 50 % fillrate and will just drop them.

In this code, I am only concerned with imputing the important variables like AMT_ANNUITY, AMT_GOODS_PRICE, NAME_TYPE_SUITE, OCCUPATION_TYPE, CNT_FAM_MEMBERS (handling them are part of my task list in the project) I am planning to build machine learning models to find the optimal values for the missing cases. Since we have about 300,000+ rows in appl_train dataset, this gives me the confidence that building imputation models are feasible.

2 Possible approaches for the above¶

  1. Approach 1 - Even after dropping columsn with <= 50% fillrate, their will still be lots of columns with > 50% but less than 100% fillrate. One possible way of fealing with these columns (for imputation model) is just dropping them. I can brign these columns back while building the main models. This will give us the freedom to use any machine learning model for imputation, even the likes of linear and logistic regression etc which are very sensitive to missing data.

  2. Approach 2 - Keep all the columns intact, dont drop any columns (other than <= 50 % fillrate columns) . Proceed with the normal flow of machine learning modelling but only restrict myself to models that can handle missing data on its own like XGB etc.

In [ ]:
# dropping columns with <= 50% fillrate

# make a list of all columns with <= 50% fillrate
low_fill_rate_columns = fill_rate_df_app_train[fill_rate_df_app_train['Fill Rate (%)'] <= 50.00]['Column Name'].tolist()
low_fill_rate_columns
Out[ ]:
['OWN_CAR_AGE',
 'EXT_SOURCE_1',
 'APARTMENTS_AVG',
 'BASEMENTAREA_AVG',
 'YEARS_BUILD_AVG',
 'COMMONAREA_AVG',
 'ELEVATORS_AVG',
 'ENTRANCES_AVG',
 'FLOORSMIN_AVG',
 'LANDAREA_AVG',
 'LIVINGAPARTMENTS_AVG',
 'LIVINGAREA_AVG',
 'NONLIVINGAPARTMENTS_AVG',
 'NONLIVINGAREA_AVG',
 'APARTMENTS_MODE',
 'BASEMENTAREA_MODE',
 'YEARS_BUILD_MODE',
 'COMMONAREA_MODE',
 'ELEVATORS_MODE',
 'ENTRANCES_MODE',
 'FLOORSMIN_MODE',
 'LANDAREA_MODE',
 'LIVINGAPARTMENTS_MODE',
 'LIVINGAREA_MODE',
 'NONLIVINGAPARTMENTS_MODE',
 'NONLIVINGAREA_MODE',
 'APARTMENTS_MEDI',
 'BASEMENTAREA_MEDI',
 'YEARS_BUILD_MEDI',
 'COMMONAREA_MEDI',
 'ELEVATORS_MEDI',
 'ENTRANCES_MEDI',
 'FLOORSMIN_MEDI',
 'LANDAREA_MEDI',
 'LIVINGAPARTMENTS_MEDI',
 'LIVINGAREA_MEDI',
 'NONLIVINGAPARTMENTS_MEDI',
 'NONLIVINGAREA_MEDI',
 'FONDKAPREMONT_MODE',
 'HOUSETYPE_MODE',
 'WALLSMATERIAL_MODE']
In [ ]:
# dropping these columns from appl_train datafram
appl_train_refined = appl_train.drop(columns = low_fill_rate_columns)
In [ ]:
appl_train_refined.head()
Out[ ]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 81 columns

4.2 EDA for the imputation Model¶

My first imputation model will be focused on the target Variabe - AMT_ANNUITY

AMT_ANNUITY represents - Loan Annuity for the particular loan application

In [ ]:
pandas_df_shape(appl_train_refined)
Shape of DataFrame : (307511, 81)
In [ ]:
# how many null values are there for the target variable
null_count = appl_train_refined['AMT_ANNUITY'].isnull().sum()
print(f"Number of null values in 'AMT_ANNUITY': {null_count}")
Number of null values in 'AMT_ANNUITY': 12

So the number of null values are just 12 for a dataset of 300,000+. Doing imputation for this seems like an over kill, but another analysis proved that this variable was of utmost importance for predicting defaults hence proceeding with the same.

Let us try to get an idea of the distribution of the target variable

In [ ]:
appl_train_refined['AMT_ANNUITY'].describe()
Out[ ]:
AMT_ANNUITY
count 307499.000000
mean 27108.573909
std 14493.737315
min 1615.500000
25% 16524.000000
50% 24903.000000
75% 34596.000000
max 258025.500000

In [ ]:
plot_distribution(appl_train_refined,'AMT_ANNUITY')
No description has been provided for this image

4.3 Inference about the Distribution of the Target Variable - AMT_ANNUITY¶

The variable AMT_ANNUITY is not normally distributed.

Histogram: The data shows a right-skewed (positively skewed) distribution. Most of the values are concentrated toward the lower end (closer to 0), with fewer occurrences of higher values, which is typical of skewness.

KDE Plot: The Kernel Density Estimate (KDE) confirms the right-skew, with a sharp peak around the lower values and a long tail extending toward higher values. This long tail suggests the presence of some higher outliers.

Box Plot: The box plot further highlights the skewness, as the median is closer to the lower quartile, and there are many outliers on the right-hand side (values much higher than the rest).

Overall, the distribution of AMT_ANNUITY is right-skewed, with most values concentrated in the lower range, and a few higher values (outliers) extending toward the upper range.

In [ ]:
appl_train_refined.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Data columns (total 81 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   SK_ID_CURR                    307511 non-null  int64  
 1   TARGET                        307511 non-null  int64  
 2   NAME_CONTRACT_TYPE            307511 non-null  object 
 3   CODE_GENDER                   307511 non-null  object 
 4   FLAG_OWN_CAR                  307511 non-null  object 
 5   FLAG_OWN_REALTY               307511 non-null  object 
 6   CNT_CHILDREN                  307511 non-null  int64  
 7   AMT_INCOME_TOTAL              307511 non-null  float64
 8   AMT_CREDIT                    307511 non-null  float64
 9   AMT_ANNUITY                   307499 non-null  float64
 10  AMT_GOODS_PRICE               307233 non-null  float64
 11  NAME_TYPE_SUITE               306219 non-null  object 
 12  NAME_INCOME_TYPE              307511 non-null  object 
 13  NAME_EDUCATION_TYPE           307511 non-null  object 
 14  NAME_FAMILY_STATUS            307511 non-null  object 
 15  NAME_HOUSING_TYPE             307511 non-null  object 
 16  REGION_POPULATION_RELATIVE    307511 non-null  float64
 17  DAYS_BIRTH                    307511 non-null  int64  
 18  DAYS_EMPLOYED                 307511 non-null  int64  
 19  DAYS_REGISTRATION             307511 non-null  float64
 20  DAYS_ID_PUBLISH               307511 non-null  int64  
 21  FLAG_MOBIL                    307511 non-null  int64  
 22  FLAG_EMP_PHONE                307511 non-null  int64  
 23  FLAG_WORK_PHONE               307511 non-null  int64  
 24  FLAG_CONT_MOBILE              307511 non-null  int64  
 25  FLAG_PHONE                    307511 non-null  int64  
 26  FLAG_EMAIL                    307511 non-null  int64  
 27  OCCUPATION_TYPE               211120 non-null  object 
 28  CNT_FAM_MEMBERS               307509 non-null  float64
 29  REGION_RATING_CLIENT          307511 non-null  int64  
 30  REGION_RATING_CLIENT_W_CITY   307511 non-null  int64  
 31  WEEKDAY_APPR_PROCESS_START    307511 non-null  object 
 32  HOUR_APPR_PROCESS_START       307511 non-null  int64  
 33  REG_REGION_NOT_LIVE_REGION    307511 non-null  int64  
 34  REG_REGION_NOT_WORK_REGION    307511 non-null  int64  
 35  LIVE_REGION_NOT_WORK_REGION   307511 non-null  int64  
 36  REG_CITY_NOT_LIVE_CITY        307511 non-null  int64  
 37  REG_CITY_NOT_WORK_CITY        307511 non-null  int64  
 38  LIVE_CITY_NOT_WORK_CITY       307511 non-null  int64  
 39  ORGANIZATION_TYPE             307511 non-null  object 
 40  EXT_SOURCE_2                  306851 non-null  float64
 41  EXT_SOURCE_3                  246546 non-null  float64
 42  YEARS_BEGINEXPLUATATION_AVG   157504 non-null  float64
 43  FLOORSMAX_AVG                 154491 non-null  float64
 44  YEARS_BEGINEXPLUATATION_MODE  157504 non-null  float64
 45  FLOORSMAX_MODE                154491 non-null  float64
 46  YEARS_BEGINEXPLUATATION_MEDI  157504 non-null  float64
 47  FLOORSMAX_MEDI                154491 non-null  float64
 48  TOTALAREA_MODE                159080 non-null  float64
 49  EMERGENCYSTATE_MODE           161756 non-null  object 
 50  OBS_30_CNT_SOCIAL_CIRCLE      306490 non-null  float64
 51  DEF_30_CNT_SOCIAL_CIRCLE      306490 non-null  float64
 52  OBS_60_CNT_SOCIAL_CIRCLE      306490 non-null  float64
 53  DEF_60_CNT_SOCIAL_CIRCLE      306490 non-null  float64
 54  DAYS_LAST_PHONE_CHANGE        307510 non-null  float64
 55  FLAG_DOCUMENT_2               307511 non-null  int64  
 56  FLAG_DOCUMENT_3               307511 non-null  int64  
 57  FLAG_DOCUMENT_4               307511 non-null  int64  
 58  FLAG_DOCUMENT_5               307511 non-null  int64  
 59  FLAG_DOCUMENT_6               307511 non-null  int64  
 60  FLAG_DOCUMENT_7               307511 non-null  int64  
 61  FLAG_DOCUMENT_8               307511 non-null  int64  
 62  FLAG_DOCUMENT_9               307511 non-null  int64  
 63  FLAG_DOCUMENT_10              307511 non-null  int64  
 64  FLAG_DOCUMENT_11              307511 non-null  int64  
 65  FLAG_DOCUMENT_12              307511 non-null  int64  
 66  FLAG_DOCUMENT_13              307511 non-null  int64  
 67  FLAG_DOCUMENT_14              307511 non-null  int64  
 68  FLAG_DOCUMENT_15              307511 non-null  int64  
 69  FLAG_DOCUMENT_16              307511 non-null  int64  
 70  FLAG_DOCUMENT_17              307511 non-null  int64  
 71  FLAG_DOCUMENT_18              307511 non-null  int64  
 72  FLAG_DOCUMENT_19              307511 non-null  int64  
 73  FLAG_DOCUMENT_20              307511 non-null  int64  
 74  FLAG_DOCUMENT_21              307511 non-null  int64  
 75  AMT_REQ_CREDIT_BUREAU_HOUR    265992 non-null  float64
 76  AMT_REQ_CREDIT_BUREAU_DAY     265992 non-null  float64
 77  AMT_REQ_CREDIT_BUREAU_WEEK    265992 non-null  float64
 78  AMT_REQ_CREDIT_BUREAU_MON     265992 non-null  float64
 79  AMT_REQ_CREDIT_BUREAU_QRT     265992 non-null  float64
 80  AMT_REQ_CREDIT_BUREAU_YEAR    265992 non-null  float64
dtypes: float64(27), int64(41), object(13)
memory usage: 190.0+ MB
In [ ]:
# Identify all the categorical variables
categorical_columns = appl_train_refined.select_dtypes(include=['object', 'category']).columns.tolist()

# Identify numerical variables (usually of type 'int64' or 'float64')
numerical_columns = appl_train_refined.select_dtypes(include=['int64', 'float64']).columns.tolist()


print("Categorical Variables:", categorical_columns)
print("Numerical Variables:", numerical_columns)
Categorical Variables: ['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE', 'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'EMERGENCYSTATE_MODE']
Numerical Variables: ['SK_ID_CURR', 'TARGET', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'YEARS_BEGINEXPLUATATION_AVG', 'FLOORSMAX_AVG', 'YEARS_BEGINEXPLUATATION_MODE', 'FLOORSMAX_MODE', 'YEARS_BEGINEXPLUATATION_MEDI', 'FLOORSMAX_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR']
In [ ]:
# printing all the categorical columns
categorical_appl_train_refined = appl_train_refined[categorical_columns]
categorical_appl_train_refined.head()
Out[ ]:
NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE OCCUPATION_TYPE WEEKDAY_APPR_PROCESS_START ORGANIZATION_TYPE EMERGENCYSTATE_MODE
0 Cash loans M N Y Unaccompanied Working Secondary / secondary special Single / not married House / apartment Laborers WEDNESDAY Business Entity Type 3 No
1 Cash loans F N N Family State servant Higher education Married House / apartment Core staff MONDAY School No
2 Revolving loans M Y Y Unaccompanied Working Secondary / secondary special Single / not married House / apartment Laborers MONDAY Government NaN
3 Cash loans F N Y Unaccompanied Working Secondary / secondary special Civil marriage House / apartment Laborers WEDNESDAY Business Entity Type 3 NaN
4 Cash loans M N Y Unaccompanied Working Secondary / secondary special Single / not married House / apartment Core staff THURSDAY Religion NaN

4.4 Feature Engineering for Imputation model¶

Now lets do some basic feature engineering to identify important features and drop useless features

Encoding Categorical variables¶

Since my plan is to use a tree based algorithm for the imputation model. It will be a reasonable choice to use label encoding instead of OneHotEncoding as tree based algorithms are not sensitive to absolute magnitude of the encoded values

In [ ]:
# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Loop through each specified column and apply label encoding
for column in categorical_columns:
    categorical_appl_train_refined[column + '_encoded'] = np.where(categorical_appl_train_refined[column].isnull(),
                                   np.nan, label_encoder.fit_transform(categorical_appl_train_refined[column]))


categorical_appl_train_refined.head(2)
<ipython-input-125-fc80b82bcd82>:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  categorical_appl_train_refined[column + '_encoded'] = np.where(categorical_appl_train_refined[column].isnull(),
<ipython-input-125-fc80b82bcd82>:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  categorical_appl_train_refined[column + '_encoded'] = np.where(categorical_appl_train_refined[column].isnull(),
<ipython-input-125-fc80b82bcd82>:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  categorical_appl_train_refined[column + '_encoded'] = np.where(categorical_appl_train_refined[column].isnull(),
<ipython-input-125-fc80b82bcd82>:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  categorical_appl_train_refined[column + '_encoded'] = np.where(categorical_appl_train_refined[column].isnull(),
<ipython-input-125-fc80b82bcd82>:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  categorical_appl_train_refined[column + '_encoded'] = np.where(categorical_appl_train_refined[column].isnull(),
<ipython-input-125-fc80b82bcd82>:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  categorical_appl_train_refined[column + '_encoded'] = np.where(categorical_appl_train_refined[column].isnull(),
<ipython-input-125-fc80b82bcd82>:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  categorical_appl_train_refined[column + '_encoded'] = np.where(categorical_appl_train_refined[column].isnull(),
<ipython-input-125-fc80b82bcd82>:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  categorical_appl_train_refined[column + '_encoded'] = np.where(categorical_appl_train_refined[column].isnull(),
<ipython-input-125-fc80b82bcd82>:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  categorical_appl_train_refined[column + '_encoded'] = np.where(categorical_appl_train_refined[column].isnull(),
<ipython-input-125-fc80b82bcd82>:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  categorical_appl_train_refined[column + '_encoded'] = np.where(categorical_appl_train_refined[column].isnull(),
<ipython-input-125-fc80b82bcd82>:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  categorical_appl_train_refined[column + '_encoded'] = np.where(categorical_appl_train_refined[column].isnull(),
<ipython-input-125-fc80b82bcd82>:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  categorical_appl_train_refined[column + '_encoded'] = np.where(categorical_appl_train_refined[column].isnull(),
<ipython-input-125-fc80b82bcd82>:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  categorical_appl_train_refined[column + '_encoded'] = np.where(categorical_appl_train_refined[column].isnull(),
Out[ ]:
NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE OCCUPATION_TYPE ... FLAG_OWN_REALTY_encoded NAME_TYPE_SUITE_encoded NAME_INCOME_TYPE_encoded NAME_EDUCATION_TYPE_encoded NAME_FAMILY_STATUS_encoded NAME_HOUSING_TYPE_encoded OCCUPATION_TYPE_encoded WEEKDAY_APPR_PROCESS_START_encoded ORGANIZATION_TYPE_encoded EMERGENCYSTATE_MODE_encoded
0 Cash loans M N Y Unaccompanied Working Secondary / secondary special Single / not married House / apartment Laborers ... 1.0 6.0 7.0 4.0 3.0 1.0 8.0 6.0 5.0 0.0
1 Cash loans F N N Family State servant Higher education Married House / apartment Core staff ... 0.0 1.0 4.0 1.0 1.0 1.0 3.0 1.0 39.0 0.0

2 rows × 26 columns

In [ ]:
# dropping non encoded columns
categorical_appl_train_refined.drop(columns = categorical_columns,inplace=True)
categorical_appl_train_refined.head(2)
<ipython-input-126-a9cb0059ca24>:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  categorical_appl_train_refined.drop(columns = categorical_columns,inplace=True)
Out[ ]:
NAME_CONTRACT_TYPE_encoded CODE_GENDER_encoded FLAG_OWN_CAR_encoded FLAG_OWN_REALTY_encoded NAME_TYPE_SUITE_encoded NAME_INCOME_TYPE_encoded NAME_EDUCATION_TYPE_encoded NAME_FAMILY_STATUS_encoded NAME_HOUSING_TYPE_encoded OCCUPATION_TYPE_encoded WEEKDAY_APPR_PROCESS_START_encoded ORGANIZATION_TYPE_encoded EMERGENCYSTATE_MODE_encoded
0 0.0 1.0 0.0 1.0 6.0 7.0 4.0 3.0 1.0 8.0 6.0 5.0 0.0
1 0.0 0.0 0.0 0.0 1.0 4.0 1.0 1.0 1.0 3.0 1.0 39.0 0.0
In [ ]:
# merging the encoded categorical variables in main df

appl_train_2 = pd.concat([appl_train_refined,categorical_appl_train_refined], axis = 1)
appl_train_2.head()
Out[ ]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... FLAG_OWN_REALTY_encoded NAME_TYPE_SUITE_encoded NAME_INCOME_TYPE_encoded NAME_EDUCATION_TYPE_encoded NAME_FAMILY_STATUS_encoded NAME_HOUSING_TYPE_encoded OCCUPATION_TYPE_encoded WEEKDAY_APPR_PROCESS_START_encoded ORGANIZATION_TYPE_encoded EMERGENCYSTATE_MODE_encoded
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 ... 1.0 6.0 7.0 4.0 3.0 1.0 8.0 6.0 5.0 0.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 ... 0.0 1.0 4.0 1.0 1.0 1.0 3.0 1.0 39.0 0.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 ... 1.0 6.0 7.0 4.0 3.0 1.0 8.0 1.0 11.0 NaN
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 ... 1.0 6.0 7.0 4.0 0.0 1.0 8.0 6.0 5.0 NaN
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 ... 1.0 6.0 7.0 4.0 3.0 1.0 3.0 4.0 37.0 NaN

5 rows × 94 columns

In [ ]:
# drop non encoded categorical columns from appl_train_2
appl_train_2.drop(columns = categorical_columns,inplace=True)
appl_train_2.head()
Out[ ]:
SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED ... FLAG_OWN_REALTY_encoded NAME_TYPE_SUITE_encoded NAME_INCOME_TYPE_encoded NAME_EDUCATION_TYPE_encoded NAME_FAMILY_STATUS_encoded NAME_HOUSING_TYPE_encoded OCCUPATION_TYPE_encoded WEEKDAY_APPR_PROCESS_START_encoded ORGANIZATION_TYPE_encoded EMERGENCYSTATE_MODE_encoded
0 100002 1 0 202500.0 406597.5 24700.5 351000.0 0.018801 -9461 -637 ... 1.0 6.0 7.0 4.0 3.0 1.0 8.0 6.0 5.0 0.0
1 100003 0 0 270000.0 1293502.5 35698.5 1129500.0 0.003541 -16765 -1188 ... 0.0 1.0 4.0 1.0 1.0 1.0 3.0 1.0 39.0 0.0
2 100004 0 0 67500.0 135000.0 6750.0 135000.0 0.010032 -19046 -225 ... 1.0 6.0 7.0 4.0 3.0 1.0 8.0 1.0 11.0 NaN
3 100006 0 0 135000.0 312682.5 29686.5 297000.0 0.008019 -19005 -3039 ... 1.0 6.0 7.0 4.0 0.0 1.0 8.0 6.0 5.0 NaN
4 100007 0 0 121500.0 513000.0 21865.5 513000.0 0.028663 -19932 -3038 ... 1.0 6.0 7.0 4.0 3.0 1.0 3.0 4.0 37.0 NaN

5 rows × 81 columns

In [ ]:
#changing all the columns to float
appl_train_2 = appl_train_2.astype(float)

4.5 Correlation Matrix¶

Now that all variables have been brought to the same type (float). We will try to identify the correlation between variables through a correlation matrix. After this step, we can proceed with dropping certain features which have very weak correlation with AMT_ANNUITY and merge variables which have very strong correlation with each other

In [ ]:
# Calculate the correlation matrix
correlation_matrix = appl_train_2.corr()

# Create the heatmap
plt.figure(figsize=(20, 16))  # You can adjust the size depending on the number of features
sns.heatmap(correlation_matrix, annot=False, cmap='coolwarm', fmt=".2f", linewidths=0.5)

plt.title('Feature Correlation Heatmap')
plt.show()
No description has been provided for this image

Its very difficult to make any inference from the above heatmap as most values fall in the gray range. Hence to make a better visualization, I want to filter out some of the weak correlated values. Also I am interested to see the distrbution of values in the correlation matric itself

In [ ]:
# Flatten the correlation matrix and filter out the 1.0 values (self-correlation)
correlation_values = correlation_matrix.values.flatten()
correlation_values = correlation_values[correlation_values != 1.0]  # Exclude self-correlations

plt.figure(figsize=(10, 6))
sns.histplot(correlation_values, bins=20, kde=True)
plt.title('Distribution of Correlation Coefficients')
plt.xlabel('Correlation Coefficient')
plt.ylabel('Frequency')
plt.grid()
plt.show()
No description has been provided for this image

From the graph, 0.05 seems to be a reasonable cut off. Therefore, I am dropping all the columns from the dataset that have a correlation value of <= 0.05 with AMT_ANNUITY

In [ ]:
correlation_with_amt_annuity = correlation_matrix['AMT_ANNUITY']

# Identify columns to drop (correlation <= 0.05)
columns_to_drop = correlation_with_amt_annuity[abs(correlation_with_amt_annuity) <= 0.05].index.tolist()

# Drop those columns from the DataFrame
appl_train_3 = appl_train_2.drop(columns=columns_to_drop)
In [ ]:
appl_train_3.head()
Out[ ]:
AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_EMPLOYED FLAG_EMP_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT ... FLAG_DOCUMENT_3 FLAG_DOCUMENT_6 FLAG_DOCUMENT_8 NAME_CONTRACT_TYPE_encoded CODE_GENDER_encoded FLAG_OWN_CAR_encoded NAME_INCOME_TYPE_encoded NAME_EDUCATION_TYPE_encoded NAME_FAMILY_STATUS_encoded ORGANIZATION_TYPE_encoded
0 202500.0 406597.5 24700.5 351000.0 0.018801 -637.0 1.0 0.0 1.0 2.0 ... 1.0 0.0 0.0 0.0 1.0 0.0 7.0 4.0 3.0 5.0
1 270000.0 1293502.5 35698.5 1129500.0 0.003541 -1188.0 1.0 0.0 2.0 1.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 4.0 1.0 1.0 39.0
2 67500.0 135000.0 6750.0 135000.0 0.010032 -225.0 1.0 0.0 1.0 2.0 ... 0.0 0.0 0.0 1.0 1.0 1.0 7.0 4.0 3.0 11.0
3 135000.0 312682.5 29686.5 297000.0 0.008019 -3039.0 1.0 0.0 2.0 2.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 7.0 4.0 0.0 5.0
4 121500.0 513000.0 21865.5 513000.0 0.028663 -3038.0 1.0 0.0 1.0 2.0 ... 0.0 0.0 1.0 0.0 1.0 0.0 7.0 4.0 3.0 37.0

5 rows × 30 columns

In [ ]:
appl_train_3.shape
Out[ ]:
(307511, 30)

We have reduced the dataset drastically from 94 columns to just 30 columns. I am not sure if selecting threshold of 0.1 was a good idea or not. But we will see

In [ ]:
# Calculate the correlation matrix
correlation_matrix2 = appl_train_3.corr()

# Create the heatmap
plt.figure(figsize=(16, 16))  # You can adjust the size depending on the number of features
sns.heatmap(correlation_matrix2, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)

plt.title('Feature Correlation Heatmap')
plt.show()
No description has been provided for this image

Observations and Inferences from the Correlation MAtrix¶

  1. The 2 features that show the strongest correlation with our target "AMT_ANNUITY" is AMT_CREDIT and AMT_GOODS_PRICE with correlation values of 0.77 and 0.88 respectively.

    • AMT_CREDIT - Credit amount of the loan
    • AMT_GOODS_PRICE - For consumer loans it is the price of the goods for which the loan is given
  2. However, when we see the correlation with these 2 variables against each other, we find that they have a correlation of 0.99

  3. The above implies that if we allow both variables to stay in the training data, it could lead to a problem of collinearity

  4. Hence I am making the decision to merge these 2 features into a single feature. We will explore how to merge them in a sensible way in the following cells

  5. We can similar problems of very strong correlation among 2 other groups of variables -

    • group 1 - (FLOORSMAX_AVG, FLOORSMAX_MODE, FLOORSMAX_MEDI) - These are measures of central tendency, hence there is no point merging them instead we will choose any one and drop the other 2

    • group 2 - (REGION_RATING_CLIENT and REGION_RATING_CLIENT_W_CITY) - These are 2 diff ways to represent the same information i.e - rating of the region where client lives (1,2,3). Hence we will pick the REGION_RATING_CLIENT_W_CITY and drop the other one as REGION_RATING_CLIENT_W_CITY includes the clients cities also not just region so its more accurate data

That is about all, now we can make the above changes and proceed with data preparation

In [ ]:
# merging the columns AMT_CREDIT and AMT_GOODS_PRICE

# First explore the value counts  of each
print(appl_train_3[['AMT_CREDIT','AMT_GOODS_PRICE']].describe())

#checking for null values
print("Null value count in AMT_CREDIT : ", appl_train_3['AMT_CREDIT'].isnull().sum())
print("Null value count in AMT_GOODS_PRICE : ",appl_train_3['AMT_GOODS_PRICE'].isnull().sum())
         AMT_CREDIT  AMT_GOODS_PRICE
count  3.075110e+05     3.072330e+05
mean   5.990260e+05     5.383962e+05
std    4.024908e+05     3.694465e+05
min    4.500000e+04     4.050000e+04
25%    2.700000e+05     2.385000e+05
50%    5.135310e+05     4.500000e+05
75%    8.086500e+05     6.795000e+05
max    4.050000e+06     4.050000e+06
Null value count in AMT_CREDIT :  0
Null value count in AMT_GOODS_PRICE :  278
In [ ]:
# check for the distribution of the 2 variables
plot_distribution(appl_train_refined,'AMT_CREDIT')
plot_distribution(appl_train_refined,'AMT_GOODS_PRICE')
No description has been provided for this image
No description has been provided for this image

I have decided to take the ratio of AMT_CREDIT to AMT_GOODS_PRICE as the merged variable

In [ ]:
# Merging and creating the variable RATIO_AMT_CREDIT_TO_GOODS_PRICE

# Impute null values in AMT_GOODS_PRICE with its median
median_goods_price = appl_train_3['AMT_GOODS_PRICE'].median()
appl_train_3['AMT_GOODS_PRICE'].fillna(median_goods_price, inplace=True)

# Calculate the ratio
appl_train_3['RATIO_AMT_CREDIT_TO_GOODS_PRICE'] = (
    appl_train_3['AMT_CREDIT'] / appl_train_3['AMT_GOODS_PRICE']
)

# Round the ratio to 2 decimal places
appl_train_3['RATIO_AMT_CREDIT_TO_GOODS_PRICE'] = (
    appl_train_3['RATIO_AMT_CREDIT_TO_GOODS_PRICE'].round(2)
)

appl_train_3.head(2)
<ipython-input-138-332679d3b631>:5: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  appl_train_3['AMT_GOODS_PRICE'].fillna(median_goods_price, inplace=True)
Out[ ]:
AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_EMPLOYED FLAG_EMP_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT ... FLAG_DOCUMENT_6 FLAG_DOCUMENT_8 NAME_CONTRACT_TYPE_encoded CODE_GENDER_encoded FLAG_OWN_CAR_encoded NAME_INCOME_TYPE_encoded NAME_EDUCATION_TYPE_encoded NAME_FAMILY_STATUS_encoded ORGANIZATION_TYPE_encoded RATIO_AMT_CREDIT_TO_GOODS_PRICE
0 202500.0 406597.5 24700.5 351000.0 0.018801 -637.0 1.0 0.0 1.0 2.0 ... 0.0 0.0 0.0 1.0 0.0 7.0 4.0 3.0 5.0 1.16
1 270000.0 1293502.5 35698.5 1129500.0 0.003541 -1188.0 1.0 0.0 2.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 4.0 1.0 1.0 39.0 1.15

2 rows × 31 columns

Take care of the other 2 groups of features with collinearity problems

In [ ]:
# dropping columns FLOORSMAX_MODE, FLOORSMAX_MEDI and keeping only FLOORSMAX_AVG
# dropping column REGION_RATING_CLIENT
# dropping columns AMT_CREDIT and AMT_GOODS_PRICE

appl_train_4 = appl_train_3.drop(columns = ['AMT_CREDIT','AMT_GOODS_PRICE','FLOORSMAX_MODE','FLOORSMAX_MEDI','REGION_RATING_CLIENT'])
appl_train_4.head(2)
Out[ ]:
AMT_INCOME_TOTAL AMT_ANNUITY REGION_POPULATION_RELATIVE DAYS_EMPLOYED FLAG_EMP_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_WORK_REGION ... FLAG_DOCUMENT_6 FLAG_DOCUMENT_8 NAME_CONTRACT_TYPE_encoded CODE_GENDER_encoded FLAG_OWN_CAR_encoded NAME_INCOME_TYPE_encoded NAME_EDUCATION_TYPE_encoded NAME_FAMILY_STATUS_encoded ORGANIZATION_TYPE_encoded RATIO_AMT_CREDIT_TO_GOODS_PRICE
0 202500.0 24700.5 0.018801 -637.0 1.0 0.0 1.0 2.0 10.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 7.0 4.0 3.0 5.0 1.16
1 270000.0 35698.5 0.003541 -1188.0 1.0 0.0 2.0 1.0 11.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 4.0 1.0 1.0 39.0 1.15

2 rows × 26 columns

4.6 Feature Scaling and Normalization for Imputation Model¶

Now that we have decided our final set of features, lets scale the columns - 'AMT_INCOME_TOTAL'

In [ ]:
mmscaler = MinMaxScaler()

# Scale the specified columns
appl_train_4[['AMT_INCOME_TOTAL']] = mmscaler.fit_transform(
    appl_train_4[['AMT_INCOME_TOTAL']]
)

appl_train_4.head(2)
Out[ ]:
AMT_INCOME_TOTAL AMT_ANNUITY REGION_POPULATION_RELATIVE DAYS_EMPLOYED FLAG_EMP_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_WORK_REGION ... FLAG_DOCUMENT_6 FLAG_DOCUMENT_8 NAME_CONTRACT_TYPE_encoded CODE_GENDER_encoded FLAG_OWN_CAR_encoded NAME_INCOME_TYPE_encoded NAME_EDUCATION_TYPE_encoded NAME_FAMILY_STATUS_encoded ORGANIZATION_TYPE_encoded RATIO_AMT_CREDIT_TO_GOODS_PRICE
0 0.001512 24700.5 0.018801 -637.0 1.0 0.0 1.0 2.0 10.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 7.0 4.0 3.0 5.0 1.16
1 0.002089 35698.5 0.003541 -1188.0 1.0 0.0 2.0 1.0 11.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 4.0 1.0 1.0 39.0 1.15

2 rows × 26 columns

In [ ]:
# dealing with DAYS_EMPLOYED

# Make DAYS_EMPLOYED positive
appl_train_4['DAYS_EMPLOYED'] = appl_train_4['DAYS_EMPLOYED'].abs()
appl_train_4['DAYS_LAST_PHONE_CHANGE'] = appl_train_4['DAYS_LAST_PHONE_CHANGE'].abs()

#apply minmax scalar
appl_train_4['DAYS_EMPLOYED'] = mmscaler.fit_transform(appl_train_4[['DAYS_EMPLOYED']])
appl_train_4['DAYS_LAST_PHONE_CHANGE'] = mmscaler.fit_transform(appl_train_4[['DAYS_LAST_PHONE_CHANGE']])

appl_train_4.head(2)
Out[ ]:
AMT_INCOME_TOTAL AMT_ANNUITY REGION_POPULATION_RELATIVE DAYS_EMPLOYED FLAG_EMP_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_WORK_REGION ... FLAG_DOCUMENT_6 FLAG_DOCUMENT_8 NAME_CONTRACT_TYPE_encoded CODE_GENDER_encoded FLAG_OWN_CAR_encoded NAME_INCOME_TYPE_encoded NAME_EDUCATION_TYPE_encoded NAME_FAMILY_STATUS_encoded ORGANIZATION_TYPE_encoded RATIO_AMT_CREDIT_TO_GOODS_PRICE
0 0.001512 24700.5 0.018801 0.001744 1.0 0.0 1.0 2.0 10.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 7.0 4.0 3.0 5.0 1.16
1 0.002089 35698.5 0.003541 0.003253 1.0 0.0 2.0 1.0 11.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 4.0 1.0 1.0 39.0 1.15

2 rows × 26 columns

In [ ]:
# Dropping the rows where AMT_ANNUITY is null and storing them in a separate df

final_rows_for_imputation = appl_train_4[appl_train_3['AMT_ANNUITY'].isnull()]
appl_train_4 = appl_train_4.dropna(subset=['AMT_ANNUITY'])
In [ ]:
final_rows_for_imputation
Out[ ]:
AMT_INCOME_TOTAL AMT_ANNUITY REGION_POPULATION_RELATIVE DAYS_EMPLOYED FLAG_EMP_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_WORK_REGION ... FLAG_DOCUMENT_6 FLAG_DOCUMENT_8 NAME_CONTRACT_TYPE_encoded CODE_GENDER_encoded FLAG_OWN_CAR_encoded NAME_INCOME_TYPE_encoded NAME_EDUCATION_TYPE_encoded NAME_FAMILY_STATUS_encoded ORGANIZATION_TYPE_encoded RATIO_AMT_CREDIT_TO_GOODS_PRICE
47531 0.001320 NaN 0.026392 0.006908 1.0 0.0 1.0 2.0 13.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 1.0 2.0 3.0 5.0 1.00
50035 0.000589 NaN 0.035792 0.003477 1.0 0.0 2.0 2.0 20.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 7.0 3.0 0.0 3.0 1.00
51594 0.001512 NaN 0.046220 0.010815 1.0 0.0 2.0 1.0 15.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 7.0 4.0 1.0 42.0 1.20
55025 0.001166 NaN 0.035792 0.005585 1.0 0.0 2.0 2.0 13.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 4.0 1.0 1.0 11.0 1.32
59934 0.001512 NaN 0.046220 0.006839 1.0 0.0 1.0 1.0 15.0 1.0 ... 0.0 1.0 0.0 1.0 1.0 7.0 4.0 3.0 33.0 1.00
75873 0.001012 NaN 0.022800 0.006708 1.0 0.0 1.0 2.0 15.0 0.0 ... 0.0 0.0 0.0 1.0 1.0 7.0 1.0 3.0 5.0 1.32
89343 0.000550 NaN 0.015221 0.010188 1.0 0.0 2.0 2.0 11.0 0.0 ... 0.0 0.0 0.0 0.0 1.0 4.0 4.0 1.0 30.0 1.00
123872 0.001512 NaN 0.019689 0.009692 1.0 1.0 2.0 2.0 12.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 7.0 4.0 0.0 42.0 1.29
207186 0.001243 NaN 0.018634 0.001292 1.0 0.0 2.0 2.0 13.0 1.0 ... 0.0 0.0 0.0 1.0 0.0 1.0 1.0 1.0 40.0 1.00
227939 0.002474 NaN 0.046220 0.004047 1.0 0.0 2.0 1.0 11.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 5.0 1.26
239329 0.001127 NaN 0.026392 0.007899 1.0 0.0 2.0 2.0 14.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 4.0 4.0 1.0 33.0 1.00
241835 0.002474 NaN 0.072508 0.000416 1.0 0.0 2.0 1.0 15.0 1.0 ... 0.0 1.0 0.0 0.0 0.0 7.0 1.0 1.0 42.0 1.10

12 rows × 26 columns

In [ ]:
# converting the dataframe to numpy arrays
X = np.array(appl_train_4.drop('AMT_ANNUITY', axis=1))
y = np.array(appl_train_4['AMT_ANNUITY'])
In [ ]:
print(X[0:3,:])
print(X.shape)
[[1.51186991e-03 1.88010000e-02 1.74404438e-03 1.00000000e+00
  0.00000000e+00 1.00000000e+00 2.00000000e+00 1.00000000e+01
  0.00000000e+00 0.00000000e+00 2.62948593e-01 8.33000000e-02
  1.49000000e-02 2.64212488e-01 1.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 1.00000000e+00 0.00000000e+00
  7.00000000e+00 4.00000000e+00 3.00000000e+00 5.00000000e+00
  1.16000000e+00]
 [2.08891949e-03 3.54100000e-03 3.25262907e-03 1.00000000e+00
  0.00000000e+00 2.00000000e+00 1.00000000e+00 1.10000000e+01
  0.00000000e+00 0.00000000e+00 6.22245775e-01 2.91700000e-01
  7.14000000e-02 1.92917055e-01 1.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  4.00000000e+00 1.00000000e+00 1.00000000e+00 3.90000000e+01
  1.15000000e+00]
 [3.57770742e-04 1.00320000e-02 6.16028233e-04 1.00000000e+00
  0.00000000e+00 1.00000000e+00 2.00000000e+00 9.00000000e+00
  0.00000000e+00 0.00000000e+00 5.55912083e-01            nan
             nan 1.89888164e-01 0.00000000e+00 0.00000000e+00
  0.00000000e+00 1.00000000e+00 1.00000000e+00 1.00000000e+00
  7.00000000e+00 4.00000000e+00 3.00000000e+00 1.10000000e+01
  1.00000000e+00]]
(307499, 25)
In [ ]:
print(y[0:3,])
print(y.shape)
[24700.5 35698.5  6750. ]
(307499,)
In [ ]:
# Train-test split (e.g., 80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=34)

print("Training Features:\n", X_train.shape)
print("Training Target:\n", y_train.shape)
print("Testing Features:\n", X_test.shape)
print("Testing Target:\n", y_test.shape)
Training Features:
 (245999, 25)
Training Target:
 (245999,)
Testing Features:
 (61500, 25)
Testing Target:
 (61500,)

4.7 LIGHTGBM Model Training with 5 fold cross validation¶

I plan on training the data in the following way

  • As you can see I have already separated 20% data for testing.
  • Now the train data (i.e 80%) will be under go the 5 fold cross validation process
  • Then we can get the best RMSE (my evaluation metric for this moodel) for 5 fold cross validation process
  • Finally we will build the model on the whole training data and make the
In [ ]:
# Create LightGBM dataset
train_data = lgbm.Dataset(X_train, label=y_train)
test_data = lgbm.Dataset(X_test, label=y_test, reference=train_data)
In [ ]:
# Set parameters for the LightGBM model
params = {
    'objective': 'regression',
    'metric': 'rmse',
    'boosting_type': 'gbdt',
    'learning_rate': 0.03,
    'num_leaves': 30,
    'verbose': 1
}

# Perform K-Fold Cross-Validation
cv_results = lgbm.cv(
    params,
    train_data,
    num_boost_round=1000,       # Number of boosting rounds
    nfold=5,                   # Number of folds
    metrics='rmse',            # Evaluation metric
    stratified=False
)

cv_score = min(cv_results['valid rmse-mean'])

# Print the CV score
print("CV RMSE Score: ", cv_score)
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.130169 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1654
[LightGBM] [Info] Number of data points in the train set: 196796, number of used features: 25
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.095617 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1654
[LightGBM] [Info] Number of data points in the train set: 196796, number of used features: 25
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.090039 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1654
[LightGBM] [Info] Number of data points in the train set: 196796, number of used features: 25
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.200164 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1654
[LightGBM] [Info] Number of data points in the train set: 196796, number of used features: 25
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.090620 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1654
[LightGBM] [Info] Number of data points in the train set: 196796, number of used features: 25
[LightGBM] [Info] Start training from score 27104.863496
[LightGBM] [Info] Start training from score 27100.001339
[LightGBM] [Info] Start training from score 27152.839776
[LightGBM] [Info] Start training from score 27121.466775
[LightGBM] [Info] Start training from score 27122.220380
CV RMSE Score:  10107.937051493398
In [ ]:
# Print cross-validation results
print("Best RMSE: ", min(cv_results['valid rmse-mean']))
Best RMSE:  10107.937051493398
In [ ]:
# Train the model with early stopping and optimal num boost rounds that we got from 5 fold cv

lgbm_model = lgbm.train(
    params,
    train_data,
    num_boost_round=1000,
    valid_sets=[test_data]
)
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.053520 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1654
[LightGBM] [Info] Number of data points in the train set: 245999, number of used features: 25
[LightGBM] [Info] Start training from score 27120.115110
In [ ]:
# Predict on the test set
y_pred = lgbm_model.predict(X_test, num_iteration=lgbm_model.best_iteration)

# Evaluate the model
rmse = mean_squared_error(y_test, y_pred, squared=False)
print("Test RMSE: ", rmse)
Test RMSE:  10117.617427842439
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_regression.py:492: FutureWarning: 'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.
  warnings.warn(
In [ ]:
# Convert y_test and y_pred to pandas Series
y_test_series = pd.Series(y_test).reset_index(drop=True)
y_pred_series = pd.Series(y_pred)

# Line Plot for Actual vs Predicted
plt.figure(figsize=(10, 6))
plt.plot(y_test_series, label='Actual Values', color='blue', linewidth=2)
plt.plot(y_pred_series, label='Predicted Values', color='orange', linestyle='--', linewidth=2)
plt.title('Actual vs Predicted Values')
plt.xlabel('Sample Index')
plt.ylabel('Values')
plt.legend()
plt.grid()
plt.show()
No description has been provided for this image
In [ ]:
# Now using the above model to impute the 12 missing values

X_missing = np.array(final_rows_for_imputation.drop('AMT_ANNUITY', axis=1))
y_missing = np.array(final_rows_for_imputation['AMT_ANNUITY'])
In [ ]:
# predict missing values
y_pred_missing = lgbm_model.predict(X_missing, num_iteration=lgbm_model.best_iteration)
In [ ]:
y_pred_missing
Out[ ]:
array([25772.46449965, 21599.84301145, 29199.5341337 , 26290.14381451,
       30433.70880488, 22291.982206  , 22188.38300024, 35438.11132234,
       27516.3155974 , 29800.85762735, 29278.1063208 , 55642.9651374 ])

Now we can use the imputed values in the training of the main model for credit risk default prediction

4.7 Imputation of other variables by Different Method¶

Here instead of building imputation models, we have opted for an analytical ruled based approach where we will impute the missing columns based on the most appropriate central tendancy metric.

In [ ]:
##Imputation Function from starting to end
# Check Missing Values
def check_missing_values_p(df: pd.DataFrame, column_name: str) -> int:
    """
    Function to count missing values (NaN) in a specified column in a Pandas DataFrame.

    Parameters:
        df : pd.DataFrame
            The input Pandas DataFrame.
        column_name : str
            The name of the column to check for missing values.

    Returns:
        int: Number of missing values in the column.
    """
    missing_values = df[column_name].isna().sum()
    print(f"Missing values in {column_name}: {missing_values}")
    return missing_values

# Check Distribution of Column
def check_distribution_p(df: pd.DataFrame, column_name: str) -> dict:
    """
    Function to display basic statistics (mean, stddev, min, max) of a specified column in a Pandas DataFrame.

    Parameters:
        df : pd.DataFrame
            The input Pandas DataFrame.
        column_name : str
            The name of the column to check the distribution.

    Returns:
        dict: A dictionary containing count, mean, stddev, min, and max values.
    """
    summary = df[column_name].describe()
    summary_dict = {
        'count': summary['count'],
        'mean': summary['mean'],
        'stddev': summary['std'] if 'std' in summary.index else None,
        'min': summary['min'],
        'max': summary['max']
    }

    print(summary_dict)
    return summary_dict

# Choose Imputation Method
def choose_imputation_method_p(count, mean, stddev, min_val, max_val) -> str:
    """
    Function to suggest the best imputation method (mean, median, or mode)
    based on the summary statistics of a column.

    Parameters:
        count (int): Total count of non-null values in the column.
        mean (float): Mean of the column values.
        stddev (float): Standard deviation of the column values.
        min_val (float): Minimum value in the column.
        max_val (float): Maximum value in the column.

    Returns:
        str: Recommended imputation method ('mean', 'median', or 'mode').
    """
    # If there's a small standard deviation, mean imputation is okay
    if stddev <= mean:
        return "mean"

    # If standard deviation is high, there's likely more variation or outliers
    if stddev > mean:
        # Mode is useful if the minimum value is 0 and data is skewed (potentially binary or categorical)
        if min_val == 0 and max_val / mean > 5:
            return "mode"
        else:
            return "median"

    # Default to median if no strong preference can be inferred
    return "median"


# Impute Missing Values with Mean, Median, Mode
def impute_method_p(df: pd.DataFrame, column_name: str, method: str) -> pd.DataFrame:
    """
    Function to impute missing values in a column with the specified method (mean, median, mode).

    Parameters:
        df : pd.DataFrame
            The input Pandas DataFrame.
        column_name : str
            The name of the column to impute missing values for.
        method : str
            The imputation method ('mean', 'median', 'mode').

    Returns:
        pd.DataFrame: DataFrame with missing values imputed.
    """
    if method == 'mean':
        mean_value = df[column_name].mean()
        print(f"Mean for {column_name}: {mean_value}")
        df[column_name].fillna(mean_value, inplace=True)

    elif method == 'median':
        median_value = df[column_name].median()
        print(f"Median for {column_name}: {median_value}")
        df[column_name].fillna(median_value, inplace=True)

    elif method == 'mode':
        mode_value = df[column_name].mode()[0]
        print(f"Mode for {column_name}: {mode_value}")
        df[column_name].fillna(mode_value, inplace=True)

    else:
        print("Invalid method")

    return df

# Verify Imputation
def verify_imputation_p(df: pd.DataFrame, column_name: str) -> int:
    """
    Function to verify if all missing values in a column have been imputed.

    Parameters:
        df : pd.DataFrame
            The input Pandas DataFrame.
        column_name : str
            The name of the column to verify imputation for.

    Returns:
        int: Number of remaining missing values.
    """
    missing_values = df[column_name].isna().sum()
    print(f"Remaining missing values in {column_name}: {missing_values}")
    return missing_values
In [ ]:
column='DEF_30_CNT_SOCIAL_CIRCLE'
In [ ]:
missing_count = check_missing_values_p(appl_train, column)
Missing values in DEF_30_CNT_SOCIAL_CIRCLE: 1021
In [ ]:
summary_stats = check_distribution_p(appl_train,column)
{'count': 306490.0, 'mean': 0.1434206662533851, 'stddev': 0.4466984293825373, 'min': 0.0, 'max': 34.0}
In [ ]:
# Choose imputation method
method = choose_imputation_method_p(
    count=summary_stats['count'],
    mean=summary_stats['mean'],
    stddev=summary_stats['stddev'],
    min_val=summary_stats['min'],
    max_val=summary_stats['max']
)
print(method)
mode
In [ ]:
appl_train = impute_method_p(appl_train, column, method)
Mode for DEF_30_CNT_SOCIAL_CIRCLE: 0.0
<ipython-input-157-4011cf5ba0bd>:107: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[column_name].fillna(mode_value, inplace=True)
In [ ]:
verify_imputation_p(appl_train, column)
Remaining missing values in DEF_30_CNT_SOCIAL_CIRCLE: 0
Out[ ]:
0
In [ ]:
summary_stats = check_distribution_p(appl_train, column)
{'count': 307511.0, 'mean': 0.14294448003486054, 'stddev': 0.4460325557437661, 'min': 0.0, 'max': 34.0}

5. Exploratory Data Analysis (for Application_Train file after cleaning)¶

There are 10 files in our library: 7 of them are data sources, and the remaining 3 are the train, test, and sample submission files. We'll check on the main file first (application_train.csv) and see what we can make out of it, then use the test set (application_test.csv) to make our submission.

In [ ]:
appl_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB
In [ ]:
appl_train.shape
Out[ ]:
(307511, 122)
  • For the training file, we have a total of 307511 observations and 122 features to consider -- with integer, float and object datatypes, and 67 features having null values.

  • The test file is almost similar: having 48744 observations, 121 features (minus the predictor variable 'TARGET'), and 64 features having null values.

Note on the words being used:

  • observations == samples == rows
  • features == fields == columns
  • defaulted == failed to meet the legal obligations of the loan
In [ ]:
# Print first 5 rows of the training file
appl_train.head()
Out[ ]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 122 columns

Since our model is to classify the 'target' column, we can plot it to visualize the current sample distribution.

In [ ]:
# Show target distribution
sns.set_style('darkgrid')
print(appl_train.TARGET.value_counts())
appl_train.TARGET.plot.hist(color='mediumseagreen').set_xlabel('Target value: 0 or 1');
TARGET
0    282686
1     24825
Name: count, dtype: int64
No description has been provided for this image

Base on the above information, the dataset is imbalanced. Only around 8% of the training set aren't repaid.

JUST TO BE CLEAR, OUR INDICATOR ON WHETHER A CURRENT/ PREVIOUS CUSTOMER DEFAULTED ON A LOAN CAN BE SEEN AT 'TARGET' COLUMN:

TARGET == 0 --> individuals who paid their loan

TARGET == 1 --> individuals who did NOT repay their loan

Now, let's take a closer look on the fields.

Analyzing Data Insights (EDA)¶

We'll be taking note of the things below during our field inspection:

  1. Completing - any missing values to fill in?
  2. Correlation - which features contribute significantly to our solution goal?
  3. Correcting - any outlier that skews our data terribly? Do we have some unusual error seen on the dataset? We may need to correct or exclude inaccurate features.
  4. Conversion - most of the time, the text features need to be converted to numerical values for our model training.
  5. Creating - maybe we can create a new feature out of the existing set of highly correlated features?

Before anything else, it's important to understand why these fields are recorded in the first place. And below is my understanding on how do we know if an applicant is a good candidate for a loan application. home_credit.jpg

(I'm not claiming that this is right, but for me, this makes sense.)

During field inspection, I'll try to comprehend whether a certain field falls on any of these questions. It can serve as a guide on how we deal with a certain feature. Below are excel sheet snippets of the initial sorting that I made on all the fields (please see TAG column).

Decisions:

  • 'SK_ID_CURR' is just the loan ID which is unique for every individual. It will not contribute to the prediction algorithm. --> DROP
  • The 'TARGET' variable is the one that we are trying to predict, so we will NOT use it as a feature.
In [ ]:
# COLUMNS TO BE DROPPED
list_col_drop = ['SK_ID_CURR']
In [ ]:
tempY=appl_train[appl_train.FLAG_OWN_CAR=='Y']
tempN=appl_train[appl_train.FLAG_OWN_CAR=='N']
tempY_targ1=tempY[tempY.TARGET==1]
tempN_targ1=tempN[tempN.TARGET==1]
In [ ]:
print('People count who owns a car:',len(tempY),'(',round(len(tempY)/len(appl_train.index)*100,2),'%)')
print('People count who DOES NOT own a car:',len(tempN),'(',round(len(tempN)/len(appl_train.index)*100,2),'%)')
print('Percentage of people who defaulted (with cars):',round(len(tempY_targ1)/len(tempY)*100,2),'%')
print('Percentage of people who defaulted (no cars):',round(len(tempN_targ1)/len(tempN)*100,2),'%')
People count who owns a car: 104587 ( 34.01 %)
People count who DOES NOT own a car: 202924 ( 65.99 %)
Percentage of people who defaulted (with cars): 7.24 %
Percentage of people who defaulted (no cars): 8.5 %
In [ ]:
tempY=appl_train[appl_train.FLAG_OWN_REALTY=='Y']
tempN=appl_train[appl_train.FLAG_OWN_REALTY=='N']
tempY_targ1=tempY[tempY.TARGET==1]
tempN_targ1=tempN[tempN.TARGET==1]

Observations:

  • It's a small difference but it looks like people with no car and/or realty tend to default more than those who have.

Decisions:

  • I'll be converting these 2 categorical fields into one nominal field called 'assets'.
In [ ]:
# CREATE NEW COLUMN : 0 - none, 1 - with car no realty, 2 - no car with realty, 3 - with car with realty
list_col_new_asset = ['FLAG_OWN_CAR','FLAG_OWN_REALTY']

The expenditure-related set is quite a lot. We can break it down base on the fields' relatedness.

Expenditure-related: THE LOAN ITSELF.

In [ ]:
sns.pairplot(appl_train[['NAME_CONTRACT_TYPE','AMT_CREDIT','AMT_ANNUITY','AMT_GOODS_PRICE']],hue='NAME_CONTRACT_TYPE');
No description has been provided for this image
In [ ]:
corr1=round(appl_train.AMT_CREDIT.corr(appl_train.AMT_GOODS_PRICE),2)
corr2=round(appl_train.AMT_ANNUITY.corr(appl_train.AMT_CREDIT),2)
corr3=round(appl_train.AMT_ANNUITY.corr(appl_train.AMT_GOODS_PRICE),2)
In [ ]:
print('Correlation of Credit amount vs Price of goods:',corr1)
print('Correlation of Annuity amount vs Credit amount:',corr2)
print('Correlation of Annuity amount vs Price of goods:',corr3)
Correlation of Credit amount vs Price of goods: 0.99
Correlation of Annuity amount vs Credit amount: 0.77
Correlation of Annuity amount vs Price of goods: 0.78
In [ ]:
cash=appl_train[appl_train.NAME_CONTRACT_TYPE == 'Cash loans']
rev=appl_train[appl_train.NAME_CONTRACT_TYPE == 'Revolving loans']
def_cash=cash[cash.TARGET==1]
def_rev=rev[rev.TARGET==1]
In [ ]:
print('Percentage of defaulted cash loan:',round(len(def_cash)/len(cash)*100,2,),'%')
print('Percentage of defaulted revolving loan:',round(len(def_rev)/len(rev)*100,2),'%')
sns.catplot(data=appl_train,x='NAME_CONTRACT_TYPE',hue='TARGET',kind='count');
Percentage of defaulted cash loan: 8.35 %
Percentage of defaulted revolving loan: 5.48 %
No description has been provided for this image

Observations:

  • AMT_CREDIT and AMT_GOODS_PRICE are highly correlated (scoring 0.99!), and has a positive linear slope - which makes sense because as the price of goods for which the loan is given gets higher, the credit amount of the loan (ofcourse) gets higher too.
  • AMT_ANNUITY is also highly correlated to AMT_CREDIT and AMT_GOODS_PRICE with a positive linearity. It's because the annuity is the monthly due amount.
  • NAME_CONTRACT_TYPE: Accounting for those who defaulted is much bigger in terms of cash loan than those with revolving loan, however, we must note that cash loan is significantly more popular to our sample consumers than the other.

Decisions:

  • AMT_GOODS_PRICE - to drop as it is quite redundant. We already have the AMT_CREDIT field to account for the loan amount.
  • AMT_ANNUITY - create new feature PERCENT_ANNUITY_INCOME to indicate the loan annuity amount relative to the person's total income.
  • AMT_CREDIT - create new feature PERCENT_CREDIT_INCOME to indicate the loan credit amount relative to the person's total income.
  • NAME_CONTRACT_TYPE - to drop since the defaulting rate for both loan type has less than 3% difference. We can focus more on the person's percentage of annuity or credit amount relative to their income.
In [ ]:
# COLUMNS TO BE DROPPED
list_col_drop.extend(['NAME_CONTRACT_TYPE','AMT_GOODS_PRICE'])

# CREATE NEW COLUMN : PERCENT_ANNUITY_INCOME
list_col_new_annuity = ['AMT_ANNUITY','AMT_INCOME_TOTAL']

# CREATE NEW COLUMN : CREDIT_ANNUITY_INCOME
list_col_new_credit = ['AMT_CREDIT','AMT_INCOME_TOTAL']

Expenditure-related: SIZE OF HOUSEHOLD

In [ ]:
sns.catplot(data=appl_train,x='NAME_FAMILY_STATUS',hue='TARGET',kind='count');
plt.xticks(rotation=90);
No description has been provided for this image
In [ ]:
appl_train[['CNT_CHILDREN','NAME_FAMILY_STATUS','CNT_FAM_MEMBERS']][appl_train.NAME_FAMILY_STATUS=='Married'].tail()
Out[ ]:
CNT_CHILDREN NAME_FAMILY_STATUS CNT_FAM_MEMBERS
307500 0 Married 2.0
307501 3 Married 5.0
307502 1 Married 3.0
307509 0 Married 2.0
307510 0 Married 2.0

Observations:

  • We have a large number of married customers in our sample population. The married set also contains the most frequent defaulting individuals.
  • 'CNT_CHILDREN', 'NAME_FAMILY_STATUS', 'CNT_FAM_MEMBERS' are all related to family size. And if we only want to consider the family size for approximation of expenditures, we already have 'CNT_FAM_MEMBERS' wherein it accounts for the customer itself, plus spouse (if any), plus number of children (if any).

Decisions:

  • Use 'CNT_FAM_MEMBER' as a feature model. Fill in missing values.
  • Drop 'CNT_CHILDREN' and 'NAME_FAMILY_STATUS'.
In [ ]:
# COLUMNS TO BE DROPPED
list_col_drop.extend(['CNT_CHILDREN', 'NAME_FAMILY_STATUS'])

# COMPLETE COLUMN :
list_col_fill_fam = ['CNT_FAM_MEMBERS']

Expenditure-related: TRAVEL

In [ ]:
corr1=round(appl_train.OWN_CAR_AGE.corr(appl_train.TARGET),2)
corr2=round(appl_train.REG_REGION_NOT_WORK_REGION.corr(appl_train.TARGET),2)
corr3=round(appl_train.REG_CITY_NOT_WORK_CITY.corr(appl_train.TARGET),2)
no_car,yes_car = appl_train.FLAG_OWN_CAR.value_counts()
In [ ]:
print('Correlation of Age of Car vs Target:',corr1)
print('Correlation of Registered Region aint Work Region vs Target:',corr2)
print('Correlation of Registered City aint Work City vs Target:',corr3)
Correlation of Age of Car vs Target: 0.04
Correlation of Registered Region aint Work Region vs Target: 0.01
Correlation of Registered City aint Work City vs Target: 0.05
In [ ]:
appl_train[['OWN_CAR_AGE','REG_REGION_NOT_WORK_REGION','REG_CITY_NOT_WORK_CITY']].describe()
Out[ ]:
OWN_CAR_AGE REG_REGION_NOT_WORK_REGION REG_CITY_NOT_WORK_CITY
count 104582.000000 307511.000000 307511.000000
mean 12.061091 0.050769 0.230454
std 11.944812 0.219526 0.421124
min 0.000000 0.000000 0.000000
25% 5.000000 0.000000 0.000000
50% 9.000000 0.000000 0.000000
75% 15.000000 0.000000 0.000000
max 91.000000 1.000000 1.000000
In [ ]:
print('How many customers own a car? :',yes_car)
print('How many customers do NOT own a car? :',no_car)
print('How many missing values on OWN_CAR_AGE? :',appl_train.OWN_CAR_AGE.isnull().sum())
How many customers own a car? : 104587
How many customers do NOT own a car? : 202924
How many missing values on OWN_CAR_AGE? : 202929

Observations:

  • We have a low correlation between these travel-related fields and the target variable.
  • The number of missing values on 'OWN_CAR_AGE' is almost == number of customers who do NOT own a car -- which CAUSES the field to have missing values.

Decisions:

  • Drop REG_REGION_NOT_WORK_REGION and REG_CITY_NOT_WORK_CITY as there is low correlation wrt TARGET variable.
  • Create new ordinal category 'Car Expenditure' from a continuous numerical feature 'OWN_CAR_AGE'. Perhaps the older the car, the higher the maintenance expenditure.
In [ ]:
# CREATE NEW COLUMN : EXPENDITURE_CAR : 0 - no car, +=1 per age band
list_col_new_car = ['OWN_CAR_AGE']

# COLUMNS TO BE DROPPED
list_col_drop.extend(['REG_REGION_NOT_WORK_REGION','REG_CITY_NOT_WORK_CITY'])

Expenditure-related: RESIDENCE

In [ ]:
temp = ['NAME_HOUSING_TYPE',
        'APARTMENTS_AVG',
        'BASEMENTAREA_AVG',
        'YEARS_BEGINEXPLUATATION_AVG',
        'YEARS_BUILD_AVG',
        'COMMONAREA_AVG',
        'ELEVATORS_AVG',
        'ENTRANCES_AVG',
        'FLOORSMAX_AVG',
        'FLOORSMIN_AVG',
        'LANDAREA_AVG',
        'LIVINGAPARTMENTS_AVG',
        'LIVINGAREA_AVG',
        'NONLIVINGAPARTMENTS_AVG',
        'NONLIVINGAREA_AVG',
        'APARTMENTS_MODE',
        'BASEMENTAREA_MODE',
        'YEARS_BEGINEXPLUATATION_MODE',
        'YEARS_BUILD_MODE',
        'COMMONAREA_MODE',
        'ELEVATORS_MODE',
        'ENTRANCES_MODE',
        'FLOORSMAX_MODE',
        'FLOORSMIN_MODE',
        'LANDAREA_MODE',
        'LIVINGAPARTMENTS_MODE',
        'LIVINGAREA_MODE',
        'NONLIVINGAPARTMENTS_MODE',
        'NONLIVINGAREA_MODE',
        'APARTMENTS_MEDI',
        'BASEMENTAREA_MEDI',
        'YEARS_BEGINEXPLUATATION_MEDI',
        'YEARS_BUILD_MEDI',
        'COMMONAREA_MEDI',
        'ELEVATORS_MEDI',
        'ENTRANCES_MEDI',
        'FLOORSMAX_MEDI',
        'FLOORSMIN_MEDI',
        'LANDAREA_MEDI',
        'LIVINGAPARTMENTS_MEDI',
        'LIVINGAREA_MEDI',
        'NONLIVINGAPARTMENTS_MEDI',
        'NONLIVINGAREA_MEDI',
        'FONDKAPREMONT_MODE',
        'HOUSETYPE_MODE',
        'TOTALAREA_MODE',
        'WALLSMATERIAL_MODE',
        'EMERGENCYSTATE_MODE']
In [ ]:
# Nullity by column
msno.bar(appl_train[temp],figsize=(20,5));
No description has been provided for this image
In [ ]:
# Identify non-numeric columns
non_numeric_columns = appl_train.select_dtypes(include=['object']).columns
print("Categorical Columns:", non_numeric_columns)
Categorical Columns: Index(['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY',
       'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
       'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE',
       'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE',
       'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE'],
      dtype='object')
In [ ]:
# One-hot encode categorical columns

appl_train_eda = pd.get_dummies(appl_train, columns=non_numeric_columns, drop_first=True)# trurned to false for the timing for visualization and hypothesis testing

# Calculate the correlation matrix
correlations = appl_train_eda.corr()

# Set a threshold for selecting strong correlations
threshold = 0.05

# Select features with correlations above the threshold with the target
best_features = correlations['TARGET'][abs(correlations['TARGET']) > threshold].index

# Create a subset of the correlation matrix with these features and the target
best_corr_matrix = correlations.loc[best_features, best_features]

# Plot the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(best_corr_matrix, annot=True, cmap='mako', vmin=-1, vmax=1, fmt=".2f", square=True)
plt.title("Heatmap of Best Correlated Features with Target Variable")
plt.show()
No description has been provided for this image

This heatmap above shows the correlation of those fields with null values: how strongly the presence or absence of one variable affects the presence of another.

  • 1 : positive correlation (blue) >> if a variable appears, the other variable definitely does too.
  • 0 : zero correlation (white) >> variables appearing or not appearing have no effect on one another.
  • -1 : negative correlation (red) >> if a variable appears, the other variable definitely does NOT.
In [ ]:
appl_train.columns.to_list()
Out[ ]:
['SK_ID_CURR',
 'TARGET',
 'NAME_CONTRACT_TYPE',
 'CODE_GENDER',
 'FLAG_OWN_CAR',
 'FLAG_OWN_REALTY',
 'CNT_CHILDREN',
 'AMT_INCOME_TOTAL',
 'AMT_CREDIT',
 'AMT_ANNUITY',
 'AMT_GOODS_PRICE',
 'NAME_TYPE_SUITE',
 'NAME_INCOME_TYPE',
 'NAME_EDUCATION_TYPE',
 'NAME_FAMILY_STATUS',
 'NAME_HOUSING_TYPE',
 'REGION_POPULATION_RELATIVE',
 'DAYS_BIRTH',
 'DAYS_EMPLOYED',
 'DAYS_REGISTRATION',
 'DAYS_ID_PUBLISH',
 'OWN_CAR_AGE',
 'FLAG_MOBIL',
 'FLAG_EMP_PHONE',
 'FLAG_WORK_PHONE',
 'FLAG_CONT_MOBILE',
 'FLAG_PHONE',
 'FLAG_EMAIL',
 'OCCUPATION_TYPE',
 'CNT_FAM_MEMBERS',
 'REGION_RATING_CLIENT',
 'REGION_RATING_CLIENT_W_CITY',
 'WEEKDAY_APPR_PROCESS_START',
 'HOUR_APPR_PROCESS_START',
 'REG_REGION_NOT_LIVE_REGION',
 'REG_REGION_NOT_WORK_REGION',
 'LIVE_REGION_NOT_WORK_REGION',
 'REG_CITY_NOT_LIVE_CITY',
 'REG_CITY_NOT_WORK_CITY',
 'LIVE_CITY_NOT_WORK_CITY',
 'ORGANIZATION_TYPE',
 'EXT_SOURCE_1',
 'EXT_SOURCE_2',
 'EXT_SOURCE_3',
 'APARTMENTS_AVG',
 'BASEMENTAREA_AVG',
 'YEARS_BEGINEXPLUATATION_AVG',
 'YEARS_BUILD_AVG',
 'COMMONAREA_AVG',
 'ELEVATORS_AVG',
 'ENTRANCES_AVG',
 'FLOORSMAX_AVG',
 'FLOORSMIN_AVG',
 'LANDAREA_AVG',
 'LIVINGAPARTMENTS_AVG',
 'LIVINGAREA_AVG',
 'NONLIVINGAPARTMENTS_AVG',
 'NONLIVINGAREA_AVG',
 'APARTMENTS_MODE',
 'BASEMENTAREA_MODE',
 'YEARS_BEGINEXPLUATATION_MODE',
 'YEARS_BUILD_MODE',
 'COMMONAREA_MODE',
 'ELEVATORS_MODE',
 'ENTRANCES_MODE',
 'FLOORSMAX_MODE',
 'FLOORSMIN_MODE',
 'LANDAREA_MODE',
 'LIVINGAPARTMENTS_MODE',
 'LIVINGAREA_MODE',
 'NONLIVINGAPARTMENTS_MODE',
 'NONLIVINGAREA_MODE',
 'APARTMENTS_MEDI',
 'BASEMENTAREA_MEDI',
 'YEARS_BEGINEXPLUATATION_MEDI',
 'YEARS_BUILD_MEDI',
 'COMMONAREA_MEDI',
 'ELEVATORS_MEDI',
 'ENTRANCES_MEDI',
 'FLOORSMAX_MEDI',
 'FLOORSMIN_MEDI',
 'LANDAREA_MEDI',
 'LIVINGAPARTMENTS_MEDI',
 'LIVINGAREA_MEDI',
 'NONLIVINGAPARTMENTS_MEDI',
 'NONLIVINGAREA_MEDI',
 'FONDKAPREMONT_MODE',
 'HOUSETYPE_MODE',
 'TOTALAREA_MODE',
 'WALLSMATERIAL_MODE',
 'EMERGENCYSTATE_MODE',
 'OBS_30_CNT_SOCIAL_CIRCLE',
 'DEF_30_CNT_SOCIAL_CIRCLE',
 'OBS_60_CNT_SOCIAL_CIRCLE',
 'DEF_60_CNT_SOCIAL_CIRCLE',
 'DAYS_LAST_PHONE_CHANGE',
 'FLAG_DOCUMENT_2',
 'FLAG_DOCUMENT_3',
 'FLAG_DOCUMENT_4',
 'FLAG_DOCUMENT_5',
 'FLAG_DOCUMENT_6',
 'FLAG_DOCUMENT_7',
 'FLAG_DOCUMENT_8',
 'FLAG_DOCUMENT_9',
 'FLAG_DOCUMENT_10',
 'FLAG_DOCUMENT_11',
 'FLAG_DOCUMENT_12',
 'FLAG_DOCUMENT_13',
 'FLAG_DOCUMENT_14',
 'FLAG_DOCUMENT_15',
 'FLAG_DOCUMENT_16',
 'FLAG_DOCUMENT_17',
 'FLAG_DOCUMENT_18',
 'FLAG_DOCUMENT_19',
 'FLAG_DOCUMENT_20',
 'FLAG_DOCUMENT_21',
 'AMT_REQ_CREDIT_BUREAU_HOUR',
 'AMT_REQ_CREDIT_BUREAU_DAY',
 'AMT_REQ_CREDIT_BUREAU_WEEK',
 'AMT_REQ_CREDIT_BUREAU_MON',
 'AMT_REQ_CREDIT_BUREAU_QRT',
 'AMT_REQ_CREDIT_BUREAU_YEAR']
In [ ]:
for i in ['APARTMENTS_AVG','LANDAREA_AVG','LIVINGAPARTMENTS_AVG','NONLIVINGAREA_MEDI']:
    temp=appl_train[['NAME_HOUSING_TYPE']][appl_train[i].isnull()]
    sns.catplot(data=temp,x='NAME_HOUSING_TYPE',kind='count',palette="rocket")
    plt.xticks(rotation=20)
    title = 'Housing Types with null values on ' + i
    plt.title(title)
<ipython-input-293-731d7be0c742>:3: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.catplot(data=temp,x='NAME_HOUSING_TYPE',kind='count',palette="rocket")
<ipython-input-293-731d7be0c742>:3: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.catplot(data=temp,x='NAME_HOUSING_TYPE',kind='count',palette="rocket")
<ipython-input-293-731d7be0c742>:3: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.catplot(data=temp,x='NAME_HOUSING_TYPE',kind='count',palette="rocket")
<ipython-input-293-731d7be0c742>:3: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.catplot(data=temp,x='NAME_HOUSING_TYPE',kind='count',palette="rocket")
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Observations:

  • Every unit/size-related field starting from 'APARTMENTS_AVG' till 'WALLSMATERIAL_MODE' are highly correlated with each other, with no value less than 0.7.
  • I've plotted 4 sample features of these normalized information on where the customer lives, and it seems that the missing values are mostly coming from 'houses/apartments' -- take note, the 'rented' apartment is a different type (with it having a low count of missing values). Come to think of it, if we are living in a house for so long, we usually don't bother knowing its floor size or maybe we've forgotten about it - which is very much understandable. (unless you are going to renovate, rent-out or sell the place!)
  • The housing type that scored second highest with null values is 'with parents', and again it is common that someone living with parents won't actually bother knowing the land and floor area of their current residence. (from where I come from, this is usually the case)

Decisions:

  • Since these are all residence-related sizes, I will consolidate them into one field to account for 'house expenditures'.
In [ ]:
# CREATE NEW COLUMN : EXPENDITURE_HOUSE
list_col_new_house = ['NAME_HOUSING_TYPE',
                    'APARTMENTS_AVG',
                    'BASEMENTAREA_AVG',
                    'YEARS_BEGINEXPLUATATION_AVG',
                    'YEARS_BUILD_AVG',
                    'COMMONAREA_AVG',
                    'ELEVATORS_AVG',
                    'ENTRANCES_AVG',
                    'FLOORSMAX_AVG',
                    'FLOORSMIN_AVG',
                    'LANDAREA_AVG',
                    'LIVINGAPARTMENTS_AVG',
                    'LIVINGAREA_AVG',
                    'NONLIVINGAPARTMENTS_AVG',
                    'NONLIVINGAREA_AVG',
                    'APARTMENTS_MODE',
                    'BASEMENTAREA_MODE',
                    'YEARS_BEGINEXPLUATATION_MODE',
                    'YEARS_BUILD_MODE',
                    'COMMONAREA_MODE',
                    'ELEVATORS_MODE',
                    'ENTRANCES_MODE',
                    'FLOORSMAX_MODE',
                    'FLOORSMIN_MODE',
                    'LANDAREA_MODE',
                    'LIVINGAPARTMENTS_MODE',
                    'LIVINGAREA_MODE',
                    'NONLIVINGAPARTMENTS_MODE',
                    'NONLIVINGAREA_MODE',
                    'APARTMENTS_MEDI',
                    'BASEMENTAREA_MEDI',
                    'YEARS_BEGINEXPLUATATION_MEDI',
                    'YEARS_BUILD_MEDI',
                    'COMMONAREA_MEDI',
                    'ELEVATORS_MEDI',
                    'ENTRANCES_MEDI',
                    'FLOORSMAX_MEDI',
                    'FLOORSMIN_MEDI',
                    'LANDAREA_MEDI',
                    'LIVINGAPARTMENTS_MEDI',
                    'LIVINGAREA_MEDI',
                    'NONLIVINGAPARTMENTS_MEDI',
                    'NONLIVINGAREA_MEDI',
                    'FONDKAPREMONT_MODE',
                    'HOUSETYPE_MODE',
                    'TOTALAREA_MODE',
                    'WALLSMATERIAL_MODE',
                    'EMERGENCYSTATE_MODE']
In [ ]:
fig,ax = plt.subplots(figsize=(16,8))
sns.countplot(data=appl_train,x='REGION_POPULATION_RELATIVE',hue='TARGET',ax=ax);
plt.xticks(rotation=90)
plt.show()
No description has been provided for this image
In [ ]:
sns.catplot(data=appl_train,x='WEEKDAY_APPR_PROCESS_START',hue='TARGET',kind='count');
plt.xticks(rotation=90);
No description has been provided for this image
In [ ]:
fig,ax = plt.subplots(figsize=(10,3))
sns.countplot(data=appl_train,x='HOUR_APPR_PROCESS_START',hue='TARGET',ax=ax);
plt.xticks(rotation=90)
plt.show()
No description has been provided for this image

Observations:

  • REGION_POPULATION_RELATIVE has min val = 0.00029 and max val = 0.072508. Normalized values usually range from 0 to 1, but for this sample, the maximum value is way too far from 1. Whatever values this may serve, there is a noticeable number of defaults as the value gets higher.
  • Majority of the customers apply during weekdays, with a few on weekends. The trend on customers who weren't able to repay the loan is similar with that of those who did.
  • Suspiciously, there are people applying for a loan account as early as 3am, and it gets denser throughout the day. Do note that those who defaulted on their loan has a similar pattern with those having good records.

Decisions:

  • Use REGION_POPULATION_RELATIVE as a feature model. The larger the population on a certain region, the more chances of getting a customer with a bad record.
  • Drop WEEKDAY_APPR_PROCESS_START. I can't say that there's little chance of defaulting if the customer opens an account during Sundays, because the account opening overall on that day is relatively low compared to other days.
  • Drop HOUR_APPR_PROCESS_START. Similar reasoning with above.
In [ ]:
# COLUMNS TO BE DROPPED
list_col_drop.extend(['WEEKDAY_APPR_PROCESS_START','HOUR_APPR_PROCESS_START'])
In [ ]:
temp=appl_train[['TARGET','EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3']]
print(temp.info())
sns.pairplot(temp,hue='TARGET');
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   TARGET        307511 non-null  int64  
 1   EXT_SOURCE_1  134133 non-null  float64
 2   EXT_SOURCE_2  306851 non-null  float64
 3   EXT_SOURCE_3  246546 non-null  float64
dtypes: float64(3), int64(1)
memory usage: 9.4 MB
None
No description has been provided for this image

Observations:

  • These 3 fields are external data source score fields.
  • Base on the plot above, those who were able to pay and did not pay can have scores fairly distributed on EXT_SOURCE fields, but it is quite evident that on the lower end of the normalized score mark (0.0-0.5), customers who paid (target=0, blue color) are much less prominent than those who didn't (target=1, orange color)... and vice versa.
  • All 3 fields have missing values.

Decisions:

  • Use EXT_SOURCE_1, EXT_SOURCE_2, EXT_SOURCE_3 as model features.
  • Complete the fields by filling up the null values base on their mean EXT_SOURCE.
In [ ]:
# COMPLETE COLUMNS : Base on mean EXT_SOURCE
list_col_fill_ext = ['EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3']
In [ ]:
appl_train[['AMT_INCOME_TOTAL','NAME_INCOME_TYPE','DAYS_EMPLOYED','OCCUPATION_TYPE','ORGANIZATION_TYPE']].describe(include='all')
Out[ ]:
AMT_INCOME_TOTAL NAME_INCOME_TYPE DAYS_EMPLOYED OCCUPATION_TYPE ORGANIZATION_TYPE
count 3.075110e+05 307511 307511.000000 211120 307511
unique NaN 8 NaN 18 58
top NaN Working NaN Laborers Business Entity Type 3
freq NaN 158774 NaN 55186 67992
mean 1.687979e+05 NaN 63815.045904 NaN NaN
std 2.371231e+05 NaN 141275.766519 NaN NaN
min 2.565000e+04 NaN -17912.000000 NaN NaN
25% 1.125000e+05 NaN -2760.000000 NaN NaN
50% 1.471500e+05 NaN -1213.000000 NaN NaN
75% 2.025000e+05 NaN -289.000000 NaN NaN
max 1.170000e+08 NaN 365243.000000 NaN NaN
In [ ]:
#Creating a Function for Threshold of the field, to find outliers
def get_thresh(df,field):
    """ Outliers are usually > 3 standard deviations away from the mean. """
    ave=np.mean(df[field])
    sdev=np.std(df[field])
    threshold=round(ave+(3*sdev),2)
    print('Threshold for',field,':',threshold)
    return threshold
In [ ]:
thresh_income = get_thresh(appl_train,'AMT_INCOME_TOTAL')
anomaly_emp = int(appl_train['DAYS_EMPLOYED'][appl_train['DAYS_EMPLOYED']>0].unique())
temp_orig=appl_train[['AMT_INCOME_TOTAL','DAYS_EMPLOYED']]
temp_no_outliers=appl_train[['AMT_INCOME_TOTAL','DAYS_EMPLOYED']][(appl_train.AMT_INCOME_TOTAL<thresh_income)&(appl_train['DAYS_EMPLOYED']<=0)]
print('Anomalous data for DAYS_EMPLOYED :',anomaly_emp)
Threshold for AMT_INCOME_TOTAL : 880166.2
Anomalous data for DAYS_EMPLOYED : 365243
<ipython-input-232-5922d575a623>:2: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  anomaly_emp = int(appl_train['DAYS_EMPLOYED'][appl_train['DAYS_EMPLOYED']>0].unique())
In [ ]:
def plotdist(df,f1,f2):
    f,axes = plt.subplots(1,2,figsize=(10,3))
    sns.distplot(df[[f1]],ax=axes[0]).set_title(f1)
    plt.xticks(rotation=75)

    sns.distplot(df[[f2]],ax=axes[1]).set_title(f2)
    plt.xticks(rotation=75)
    plt.tight_layout()
In [ ]:
# AMT_INCOME_TOTAL, DAYS_EMPLOYED --> ORIGINAL VALUE WITH OUTLIERS
plotdist(temp_orig,'AMT_INCOME_TOTAL','DAYS_EMPLOYED')
<ipython-input-233-0e8e66562f70>:3: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(df[[f1]],ax=axes[0]).set_title(f1)
<ipython-input-233-0e8e66562f70>:6: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(df[[f2]],ax=axes[1]).set_title(f2)
No description has been provided for this image
In [ ]:
# AMT_INCOME_TOTAL, DAYS_EMPLOYED --> OUTLIERS REMOVED
plotdist(temp_no_outliers,'AMT_INCOME_TOTAL','DAYS_EMPLOYED')
<ipython-input-233-0e8e66562f70>:3: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(df[[f1]],ax=axes[0]).set_title(f1)
<ipython-input-233-0e8e66562f70>:6: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(df[[f2]],ax=axes[1]).set_title(f2)
No description has been provided for this image
In [ ]:
# NUMBER OF INDIVIDUALS HAVING THE DAYS EMPLOYED ANOMALOUS DATA
len(appl_train[appl_train.DAYS_EMPLOYED==anomaly_emp])
Out[ ]:
55374
In [ ]:
unpaid=appl_train[appl_train.TARGET==1]
sns.catplot(data=unpaid,x='NAME_INCOME_TYPE',kind='count');
plt.xticks(rotation=90);
plt.title('Income Stream of DEFAULTED LOANS');
No description has been provided for this image
In [ ]:
fig,ax = plt.subplots(figsize=(15,5))
sns.countplot(data=unpaid,x='ORGANIZATION_TYPE',ax=ax);
plt.xticks(rotation=90)
plt.title('Organization of people with DEFAULTED LOANS')
plt.show()
No description has been provided for this image
In [ ]:
def plotstrip(df,xval,yval,hueval,yfig):
    fig,ax = plt.subplots(figsize=(15,yfig))
    sns.stripplot(x=xval,y=yval,hue=hueval,data=df,alpha=0.5,jitter=0.8,dodge=True,ax=ax).set_title(yval);
    plt.legend(bbox_to_anchor=(1.05, 1))
    plt.show()
In [ ]:
temp=appl_train[appl_train.OCCUPATION_TYPE.isnull()]
print('Individuals that left OCCUPATION_TYPE field blank:')
print(temp.NAME_INCOME_TYPE.value_counts())
Individuals that left OCCUPATION_TYPE field blank:
NAME_INCOME_TYPE
Pensioner               55357
Working                 24920
Commercial associate    12297
State servant            3787
Unemployed                 22
Student                     5
Businessman                 2
Maternity leave             1
Name: count, dtype: int64
In [ ]:
plotstrip(appl_train[appl_train.AMT_INCOME_TOTAL<thresh_income],'AMT_INCOME_TOTAL','NAME_INCOME_TYPE','TARGET',5)
No description has been provided for this image
In [ ]:
plotstrip(appl_train[appl_train.AMT_INCOME_TOTAL<thresh_income],'AMT_INCOME_TOTAL','ORGANIZATION_TYPE','TARGET',20)
No description has been provided for this image

Observations:

  • AMT_INCOME_TOTAL has an outlier. As per the data: the 75th percentile is equal to 202,500, while the maximum value is equal to a whopping 117,000,000; It is quite a big difference as it means that 75% of our customers are already making a total income of 202,500 or below - the remaining 25% has a total income higher than 202,500.
  • AMT_INCOME_TOTAL has a wide range of values, from min val 25,650 to max val 117,000,000.
  • DAYS_EMPLOYED is having inconsistent data. It contains negative values (since it is in the form of 'days' relative to the application), then suddenly there is one very big positive number (365243 days - which equates to 1000 years!) with a significant number of individuals having it.
  • NAME_INCOME_TYPE: The 'working' category is the most dense in terms of low wage high default customers. We also have very few samples on 'unemployed', 'student', 'maternity leave' and 'businessman'. Interesting to see that the 'businessman' category has an above average total income and has a high chance that they will maintain good credit scoring.
  • NAME_INCOME_TYPE 'Pensioner' is almost equal to the DAYS_EMPLOYED anomaly data count! And it is also almost equal to the ORGANIZATION_TYPE value 'XNA' ! It just means that the '1000 years' employment duration was made for retired people.
  • OCCUPATION_TYPE has 31% of its values missing. Majority of the null OCCUPATION_TYPE are from pensioners (which totally makes sense), then other big chunks are from working people, commercial associate, state servant, etc. My take here is that the choices on occupation type field is limited which led these number of working people to leaving it blank.
  • The ORGANIZATION_TYPE is pretty diverse regarding where do these customers work. But base on the histogram, the category where the defaulting individuals are dominant are those in Business Entity Type 3, self-employed, and XNA.

Decisions:

  • AMT_INCOME_TOTAL: Create new feature 'IncomeBand' to convert into ordinal income band category.
  • DAYS_EMPLOYED: Convert anomalous data 365243 days to -29200 days (equal to 80 yrs). I think it's safe to say that an individual is retired by then after working for 80 yrs. Create new feature 'years employed' to convert number of employed 'days' to 'years' for easier reading.
  • OCCUPATION_TYPE: to drop. We can leverage on how much is their total stream of income and how they acquire it.
  • NAME_INCOME_TYPE: to drop. We can usually identify if someone is working, or a student, or already retired, etc., base on the age and ORGANIZATION_TYPE.
  • ORGANIZATION_TYPE: Convert categorical text to numerical. Use feature for modeling.
In [ ]:
# COLUMNS TO BE DROPPED
list_col_drop.extend(['OCCUPATION_TYPE','NAME_INCOME_TYPE'])

# CREATE NEW COLUMN : INCOME_BAND
list_col_new_income = ['AMT_INCOME_TOTAL']

# CONVERT COLUMN : 365243 to -29200
list_col_conv_daysemp = ['DAYS_EMPLOYED']

# CREATE NEW COLUMN : YEARS_EMPLOYED
list_col_new_yrsemp = ['DAYS_EMPLOYED']

# CONVERT COLUMN :
list_col_conv_org = ['ORGANIZATION_TYPE']

And for the final set of variables, I tagged it as 'trustworthiness' to account for trusthworthy credentials, or environmental/ unconsious factors that may shape an individual's behavior. It is quite a lot, so let's break it down again per function. (Note: the excel sheet above is incomplete as this notebook keeps crashing whenever I paste the remaining fields)

In [ ]:
appl_train[['CODE_GENDER','NAME_TYPE_SUITE','NAME_EDUCATION_TYPE','DAYS_REGISTRATION','DAYS_ID_PUBLISH']].describe(include='all')
Out[ ]:
CODE_GENDER NAME_TYPE_SUITE NAME_EDUCATION_TYPE DAYS_REGISTRATION DAYS_ID_PUBLISH
count 307511 306219 307511 307511.000000 307511.000000
unique 3 7 5 NaN NaN
top F Unaccompanied Secondary / secondary special NaN NaN
freq 202448 248526 218391 NaN NaN
mean NaN NaN NaN -4986.120328 -2994.202373
std NaN NaN NaN 3522.886321 1509.450419
min NaN NaN NaN -24672.000000 -7197.000000
25% NaN NaN NaN -7479.500000 -4299.000000
50% NaN NaN NaN -4504.000000 -3254.000000
75% NaN NaN NaN -2010.000000 -1720.000000
max NaN NaN NaN 0.000000 0.000000
In [ ]:
g = sns.FacetGrid(appl_train,row='CODE_GENDER',col='NAME_EDUCATION_TYPE',hue='TARGET',height=4) # nominal
g.map(plt.scatter,'DAYS_ID_PUBLISH','DAYS_REGISTRATION',alpha=0.5,edgecolor='k',linewidth=0.5) # continuous

fig = g.fig
fig.set_size_inches(25,10)
fig.subplots_adjust(top=0.85,wspace=0.3)
fig.suptitle('Gender - Educational Attainment - Registration Change - ID Change - Credit Ranking',fontsize=20)

l = g.add_legend(title='Credit Score')
No description has been provided for this image
In [ ]:
paid = appl_train[appl_train.TARGET==0]
unpaid = appl_train[appl_train.TARGET==1]

f,axes = plt.subplots(1,2,figsize=(10,3))
sns.kdeplot(x=paid['DAYS_ID_PUBLISH'], y=paid['DAYS_REGISTRATION'], cmap="Blues", shade=True, ax=axes[0]).set_title('Paid')
sns.kdeplot(x=unpaid['DAYS_ID_PUBLISH'], y=unpaid['DAYS_REGISTRATION'], cmap="Reds", shade=True, ax=axes[1]).set_title('Unpaid')
sns.set_style('whitegrid')
plt.tight_layout()
<ipython-input-246-e3095a443fec>:5: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(x=paid['DAYS_ID_PUBLISH'], y=paid['DAYS_REGISTRATION'], cmap="Blues", shade=True, ax=axes[0]).set_title('Paid')
<ipython-input-246-e3095a443fec>:6: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(x=unpaid['DAYS_ID_PUBLISH'], y=unpaid['DAYS_REGISTRATION'], cmap="Reds", shade=True, ax=axes[1]).set_title('Unpaid')
No description has been provided for this image
In [ ]:
appl_train.NAME_EDUCATION_TYPE.value_counts()
Out[ ]:
count
NAME_EDUCATION_TYPE
Secondary / secondary special 218391
Higher education 74863
Incomplete higher 10277
Lower secondary 3816
Academic degree 164

In [ ]:
f,axes = plt.subplots(1,2,figsize=(10,5),sharex=True)
sns.distplot(appl_train[['DAYS_BIRTH']][appl_train.TARGET==0],hist=False,color="b",kde_kws={"shade":True},ax=axes[0]).set_title('Target == 0 (Paid)');
sns.distplot(appl_train[['DAYS_BIRTH']][appl_train.TARGET==1],hist=False,color="r",kde_kws={"shade":True},ax=axes[1]).set_title('Target == 1 (Unpaid)');
<ipython-input-248-58d2041d4562>:2: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `kdeplot` (an axes-level function for kernel density plots).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(appl_train[['DAYS_BIRTH']][appl_train.TARGET==0],hist=False,color="b",kde_kws={"shade":True},ax=axes[0]).set_title('Target == 0 (Paid)');
/usr/local/lib/python3.10/dist-packages/seaborn/distributions.py:2496: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  kdeplot(**{axis: a}, ax=ax, color=kde_color, **kde_kws)
<ipython-input-248-58d2041d4562>:3: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `kdeplot` (an axes-level function for kernel density plots).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(appl_train[['DAYS_BIRTH']][appl_train.TARGET==1],hist=False,color="r",kde_kws={"shade":True},ax=axes[1]).set_title('Target == 1 (Unpaid)');
/usr/local/lib/python3.10/dist-packages/seaborn/distributions.py:2496: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  kdeplot(**{axis: a}, ax=ax, color=kde_color, **kde_kws)
No description has been provided for this image

Observations:

  • We have 3 types of CODE_GENDER in our sample: Male, female and XNA. Male and female customers are pretty dense, both having same chances of defaulting relative to their total count. We have a very small set of XNA (non-binary?) people, and base on our sample, 100% of them have good credit record.
  • For NAME_EDUCATION_TYPE, majority of our customers - whether male, female, or XNA - are having 'secondary special' or 'higher education', and most of the defaulting customers came from this educational background. 'Incomplete higher' and 'Lower secondary' are lesser with also less defaulting customers. And the least count are those with 'Academic degree' which got the least number of unpaid customers.
  • The number of days in which the customer changed his/her identity document and/or application registration seemed to have the same pattern for those having good credit and bad credit.
  • The DAYS_BIRTH of a customer (or rather the 'age') seem as expected. We can note that the x-axis values are negative - because it is recorded relative prior to the loan application - and very high (because it is in 'days' format rather than in 'years'). As per the graph, younger people tend to default more. As people gets older, they seem to be able to have a higher chance of paying off the loan.

Decisions:

  • Use CODE_GENDER as a model feature. Convert the categorical text to numeric.
  • Use NAME_EDUCATION_TYPE as a model feature. Convert the categorical text to numeric.
  • Drop DAYS_REGISTRATION, DAYS_ID_PUBLISH, NAME_TYPE_SUITE. I generated a KDE plot and I hardly see the difference.
  • Create new feature 'age' from DAYS_BIRTH. Compute for the age in years rather than using the day count.
In [ ]:
# COLUMNS TO BE DROPPED
list_col_drop.extend(['NAME_TYPE_SUITE','DAYS_REGISTRATION','DAYS_ID_PUBLISH'])

# CONVERT COLUMN :
list_col_conv_gender = ['CODE_GENDER']

# CONVERT COLUMN :
list_col_conv_edu = ['NAME_EDUCATION_TYPE']

# CREATE NEW COLUMN : AGE
list_col_new_age = ['DAYS_BIRTH']
In [ ]:
temp1=['FLAG_MOBIL',
'FLAG_EMP_PHONE',
'FLAG_WORK_PHONE',
'FLAG_CONT_MOBILE',
'FLAG_PHONE',
'FLAG_EMAIL']

temp2=['FLAG_DOCUMENT_2',
'FLAG_DOCUMENT_3',
'FLAG_DOCUMENT_4',
'FLAG_DOCUMENT_5',
'FLAG_DOCUMENT_6',
'FLAG_DOCUMENT_7',
'FLAG_DOCUMENT_8',
'FLAG_DOCUMENT_9',
'FLAG_DOCUMENT_10',
'FLAG_DOCUMENT_11',
'FLAG_DOCUMENT_12',
'FLAG_DOCUMENT_13',
'FLAG_DOCUMENT_14',
'FLAG_DOCUMENT_15',
'FLAG_DOCUMENT_16',
'FLAG_DOCUMENT_17',
'FLAG_DOCUMENT_18',
'FLAG_DOCUMENT_19',
'FLAG_DOCUMENT_20',
'FLAG_DOCUMENT_21']
In [ ]:
appl_train[temp1+temp2].describe()
Out[ ]:
FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 ... FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21
count 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 ... 307511.000000 307511.000000 307511.000000 307511.00000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000
mean 0.999997 0.819889 0.199368 0.998133 0.281066 0.056720 0.000042 0.710023 0.000081 0.015115 ... 0.000007 0.003525 0.002936 0.00121 0.009928 0.000267 0.008130 0.000595 0.000507 0.000335
std 0.001803 0.384280 0.399526 0.043164 0.449521 0.231307 0.006502 0.453752 0.009016 0.122010 ... 0.002550 0.059268 0.054110 0.03476 0.099144 0.016327 0.089798 0.024387 0.022518 0.018299
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 1.000000 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 1.000000 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 1.000000 1.000000 0.000000 1.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.00000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

8 rows × 26 columns

In [ ]:
#Sums up items per row across all columns. Returns df with new sum column and catplot.
def featsum(cols,newcol):

    sample_count=appl_train[cols].sum(axis=1)
    sample = appl_train.copy()
    sample[newcol]=sample_count
    sns.catplot(data=sample,x=newcol,hue='TARGET',kind='count');
In [ ]:
featsum(temp1,'FlagContact')
No description has been provided for this image
In [ ]:
featsum(temp2,'FlagDocu')
No description has been provided for this image

Observations:

  • All 26 'FLAG-' (flag for contacts and documents) variables have nominal categorical values: 1='YES', 0='NO'.

Decisions:

  • Create new feature 'FlagContact' to account for the total flag-contact variable recorded per individual.
  • Create new feature 'FlagDocu' to account for the total flag-document variable recorded per individual.
In [ ]:
# CREATE NEW COLUMN : FLAG_CONTACTS
list_col_new_flagCont = [
    'FLAG_MOBIL',
    'FLAG_EMP_PHONE',
    'FLAG_WORK_PHONE',
    'FLAG_CONT_MOBILE',
    'FLAG_PHONE',
    'FLAG_EMAIL']

# CREATE NEW COLUMN : FLAG_DOCS
list_col_new_flagDoc = [
    'FLAG_DOCUMENT_2',
    'FLAG_DOCUMENT_3',
    'FLAG_DOCUMENT_4',
    'FLAG_DOCUMENT_5',
    'FLAG_DOCUMENT_6',
    'FLAG_DOCUMENT_7',
    'FLAG_DOCUMENT_8',
    'FLAG_DOCUMENT_9',
    'FLAG_DOCUMENT_10',
    'FLAG_DOCUMENT_11',
    'FLAG_DOCUMENT_12',
    'FLAG_DOCUMENT_13',
    'FLAG_DOCUMENT_14',
    'FLAG_DOCUMENT_15',
    'FLAG_DOCUMENT_16',
    'FLAG_DOCUMENT_17',
    'FLAG_DOCUMENT_18',
    'FLAG_DOCUMENT_19',
    'FLAG_DOCUMENT_20',
    'FLAG_DOCUMENT_21']
In [ ]:
#Function for creating categorical plots
def plotcat(x,r):
    sns.catplot(data=appl_train,x=x,hue='TARGET',kind='count');
    plt.xticks(rotation=r);
In [ ]:
plotcat('REGION_RATING_CLIENT',0)
No description has been provided for this image
In [ ]:
plotcat('REGION_RATING_CLIENT_W_CITY',0)
No description has been provided for this image
In [ ]:
print('Correlation:',round(appl_train['REGION_RATING_CLIENT_W_CITY'].corr(appl_train['REGION_RATING_CLIENT']),2))
Correlation: 0.95
In [ ]:
temp=[
    'REG_REGION_NOT_LIVE_REGION',
    'LIVE_REGION_NOT_WORK_REGION',
    'REG_CITY_NOT_LIVE_CITY',
    'LIVE_CITY_NOT_WORK_CITY'
]
In [ ]:
featsum(temp,'FlagAddr')
No description has been provided for this image

Observations:

  • REGION_RATING_CLIENT & REGION_RATING_CLIENT_W_CITY is highly correlated, scoring 0.95. This is Home Credit's rating of the region where client lives (values = 1,2,3). Majority of the sample obtains a value of '2', and the defaulting count is visible
  • The 4 fields related to flags if the customer's certain (registered) address does not match another given address, the values being: 0='SAME', 1='DIFFERENT'. Most of the customers registered their addresses accurately (scoring 0). And since '0' has a very high frequency, there is a greater chance of encountering different kinds of people that may default on their loan.

Decisions:

  • Use REGION_RATING_CLIENT as a feature model.
  • Drop REGION_RATING_CLIENT_W_CITY since it is highly correlated to region rating alone.
  • Create new ordinal feature 'flag address' to indicate the sum of all 4 flag-address-related fields. The higher the score, the more inaccurate address declarations were made by the individual.
In [ ]:
# COLUMNS TO BE DROPPED
list_col_drop.extend(['REGION_RATING_CLIENT_W_CITY'])

# CREATE NEW COLUMN : FLAG_ADDR
list_col_new_flagAddr = ['REG_REGION_NOT_LIVE_REGION','LIVE_REGION_NOT_WORK_REGION','REG_CITY_NOT_LIVE_CITY','LIVE_CITY_NOT_WORK_CITY']
In [ ]:
appl_train[['OBS_30_CNT_SOCIAL_CIRCLE','DEF_30_CNT_SOCIAL_CIRCLE','OBS_60_CNT_SOCIAL_CIRCLE','DEF_60_CNT_SOCIAL_CIRCLE','DAYS_LAST_PHONE_CHANGE']].describe()
Out[ ]:
OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE
count 306490.000000 307511.000000 306490.000000 306490.000000 307510.000000
mean 1.422245 0.142944 1.405292 0.100049 -962.858788
std 2.400989 0.446033 2.379803 0.362291 826.808487
min 0.000000 0.000000 0.000000 0.000000 -4292.000000
25% 0.000000 0.000000 0.000000 0.000000 -1570.000000
50% 0.000000 0.000000 0.000000 0.000000 -757.000000
75% 2.000000 0.000000 2.000000 0.000000 -274.000000
max 348.000000 34.000000 344.000000 24.000000 0.000000
In [ ]:
#Sums up items per row across all columns.Returns df with new sum column and violinplot.
def featsumviolin(df,cols1,cols2,newcol1,newcol2):

    sample_count1=df[cols1].sum(axis=1)
    sample_count2=df[cols2].sum(axis=1)
    sample = appl_train.copy()
    sample[newcol1]=sample_count1
    sample[newcol2]=sample_count2
    fig,ax = plt.subplots(figsize=(10,5))
    sns.violinplot(data=sample,hue='TARGET',x=newcol1,y=newcol2,split=True,inner='quart',linewidth=1.3,palette={1:"#FF9999", 0:"white"});
In [ ]:
featsumviolin(appl_train[(appl_train.OBS_30_CNT_SOCIAL_CIRCLE<348)&(appl_train.OBS_60_CNT_SOCIAL_CIRCLE<344)],
              ['DEF_30_CNT_SOCIAL_CIRCLE','DEF_60_CNT_SOCIAL_CIRCLE'],
              ['OBS_30_CNT_SOCIAL_CIRCLE','OBS_60_CNT_SOCIAL_CIRCLE'],
              'DEF_NEW','OBS_NEW')
No description has been provided for this image
In [ ]:
sns.set(palette="muted",color_codes=True)
f,axes = plt.subplots(2,2,figsize=(10,5),sharex=True)

sns.distplot(appl_train[['DAYS_LAST_PHONE_CHANGE']][appl_train.TARGET==0],kde=False,color="b",ax=axes[0,0]).set_title('Target == 0 (Paid)')
sns.distplot(appl_train[['DAYS_LAST_PHONE_CHANGE']][appl_train.TARGET==0],hist=False,color="g",kde_kws={"shade":True},ax=axes[0,1]).set_title('Target == 0 (Paid)')

sns.distplot(appl_train[['DAYS_LAST_PHONE_CHANGE']][appl_train.TARGET==1],kde=False,color="r",ax=axes[1,0]).set_title('Target == 1 (Unpaid)')
sns.distplot(appl_train[['DAYS_LAST_PHONE_CHANGE']][appl_train.TARGET==1],hist=False,color="m",kde_kws={"shade":True},ax=axes[1,1]).set_title('Target == 1 (Unpaid)')

plt.tight_layout()
<ipython-input-266-9858b9742ccf>:4: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(appl_train[['DAYS_LAST_PHONE_CHANGE']][appl_train.TARGET==0],kde=False,color="b",ax=axes[0,0]).set_title('Target == 0 (Paid)')
<ipython-input-266-9858b9742ccf>:5: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `kdeplot` (an axes-level function for kernel density plots).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(appl_train[['DAYS_LAST_PHONE_CHANGE']][appl_train.TARGET==0],hist=False,color="g",kde_kws={"shade":True},ax=axes[0,1]).set_title('Target == 0 (Paid)')
/usr/local/lib/python3.10/dist-packages/seaborn/distributions.py:2496: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  kdeplot(**{axis: a}, ax=ax, color=kde_color, **kde_kws)
<ipython-input-266-9858b9742ccf>:7: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(appl_train[['DAYS_LAST_PHONE_CHANGE']][appl_train.TARGET==1],kde=False,color="r",ax=axes[1,0]).set_title('Target == 1 (Unpaid)')
<ipython-input-266-9858b9742ccf>:8: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `kdeplot` (an axes-level function for kernel density plots).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(appl_train[['DAYS_LAST_PHONE_CHANGE']][appl_train.TARGET==1],hist=False,color="m",kde_kws={"shade":True},ax=axes[1,1]).set_title('Target == 1 (Unpaid)')
/usr/local/lib/python3.10/dist-packages/seaborn/distributions.py:2496: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  kdeplot(**{axis: a}, ax=ax, color=kde_color, **kde_kws)
No description has been provided for this image

Observations:

  • There are 2 variables related to 30 days past due (30 DPD), and another 2 for 60 days past due (60 DPD). I have combined them and plotted using a violinplot as seen above. From what I understand here, those people that defaulted, and even those who didn't, have equal possibilities of missing a payment on their due date. Though surprisingly, majority of those who paid on a much later date got their loans cleared off.

Note: A payment status of 30-days late means that payment is between 30-59 days past the payment due date (30 DPD). A payment status of 60-days late means that payment is between 60-89 days past the payment due date (60 DPD).

  • I plotted separately the days in which a customer last changed their phone (relative to their application) for those having good and bad loan status records. The result is pretty much expected, as most people tend to change phones in a span of 2-3 years only (hence the distribution is skewed to the right). We can see that the people count gets lesser and lesser as the usage of their phone gets longer than the average.

Decisions:

  • Drop the DPD variables.
  • Drop DAYS_LAST_PHONE_CHANGE.
In [ ]:
# COLUMNS TO BE DROPPED
list_col_drop.extend(['OBS_30_CNT_SOCIAL_CIRCLE','DEF_30_CNT_SOCIAL_CIRCLE','OBS_60_CNT_SOCIAL_CIRCLE','DEF_60_CNT_SOCIAL_CIRCLE','DAYS_LAST_PHONE_CHANGE'])
In [ ]:
appl_train[['AMT_REQ_CREDIT_BUREAU_HOUR','AMT_REQ_CREDIT_BUREAU_DAY','AMT_REQ_CREDIT_BUREAU_WEEK','AMT_REQ_CREDIT_BUREAU_MON','AMT_REQ_CREDIT_BUREAU_QRT','AMT_REQ_CREDIT_BUREAU_YEAR']].describe()
Out[ ]:
AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
count 265992.000000 265992.000000 265992.000000 265992.000000 265992.000000 265992.000000
mean 0.006402 0.007000 0.034362 0.267395 0.265474 1.899974
std 0.083849 0.110757 0.204685 0.916002 0.794056 1.869295
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000
75% 0.000000 0.000000 0.000000 0.000000 0.000000 3.000000
max 4.000000 9.000000 8.000000 27.000000 261.000000 25.000000
In [ ]:
for i in ['AMT_REQ_CREDIT_BUREAU_HOUR','AMT_REQ_CREDIT_BUREAU_DAY','AMT_REQ_CREDIT_BUREAU_WEEK']:
    for j in ['AMT_REQ_CREDIT_BUREAU_MON','AMT_REQ_CREDIT_BUREAU_QRT','AMT_REQ_CREDIT_BUREAU_YEAR']:
        fig,ax = plt.subplots(figsize=(10,3))
        sns.stripplot(data=appl_train[appl_train.AMT_REQ_CREDIT_BUREAU_QRT<261],x=i,y=j,hue='TARGET',alpha=0.5,jitter=0.3,dodge=True,ax=ax)
        plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Observations:

  • What I can see here is that, majority of the sample don't really have queries (0.0) regardless what time period it is, and regardless of their repayment status. The most common enquiry count ranges from 0 to 2, but then the behavior is similar to both TARGET==0 and TARGET==1. Do take note that we only have a few individuals (about 8%) in our sample population that defaulted (TARGET==1).

Decisions:

  • Drop the credit bureau enquiry fields.
In [ ]:
# COLUMNS TO BE DROPPED
list_col_drop.extend(['AMT_REQ_CREDIT_BUREAU_HOUR','AMT_REQ_CREDIT_BUREAU_DAY','AMT_REQ_CREDIT_BUREAU_WEEK','AMT_REQ_CREDIT_BUREAU_MON','AMT_REQ_CREDIT_BUREAU_QRT','AMT_REQ_CREDIT_BUREAU_YEAR'])

6. Data Wrangling¶

After (finally) checking all our fields, it is time to proceed with data wrangling - also known as the data cleaning process.

Complete the null values of the following features:

  • 'EXT_SOURCE_1'
  • 'EXT_SOURCE_2'
  • 'EXT_SOURCE_3'
In [ ]:
#Essential Functions
def getmean(df,ls_cols):
    list_mean = []
    for i in ls_cols:
        mean_val = df[i].mean()
        list_mean.append(mean_val)
    return list_mean

def fill_ave_ext(df,ls_cols):
    list_mean = getmean(df,ls_cols) # mean of EXT_SOURCE_*
    ctr=0
    for i in ls_cols:
        df[i] = df[i].fillna(list_mean[ctr])
        ctr+=1
    return df

# Fill in the training set
fill_ave_ext(appl_train,list_col_fill_ext);

# Fill in the testing set
fill_ave_ext(appl_test,list_col_fill_ext);
In [ ]:
# NO MORE NULL VALUES FOR 'EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3'
print(appl_train[list_col_fill_ext].info())
print(appl_test[list_col_fill_ext].info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Data columns (total 3 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   EXT_SOURCE_1  307511 non-null  float64
 1   EXT_SOURCE_2  307511 non-null  float64
 2   EXT_SOURCE_3  307511 non-null  float64
dtypes: float64(3)
memory usage: 7.0 MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48744 entries, 0 to 48743
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   EXT_SOURCE_1  48744 non-null  float64
 1   EXT_SOURCE_2  48744 non-null  float64
 2   EXT_SOURCE_3  48744 non-null  float64
dtypes: float64(3)
memory usage: 1.1 MB
None

Complete the null values for 'CNT_FAM_MEMBERS'.

In [ ]:
#Complete the null values for 'CNT_FAM_MEMBERS'.
def fill_0_fam(df,ls_cols):
    df[ls_cols] = df[ls_cols].fillna(0)
    return df


fill_0_fam(appl_train,'CNT_FAM_MEMBERS');
fill_0_fam(appl_test,'CNT_FAM_MEMBERS');
In [ ]:
# NO MORE NULL VALUES FOR 'CNT_FAM_MEMBERS'
print(appl_train[list_col_fill_fam].info())
print(appl_test[list_col_fill_fam].info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Data columns (total 1 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   CNT_FAM_MEMBERS  307511 non-null  float64
dtypes: float64(1)
memory usage: 2.3 MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48744 entries, 0 to 48743
Data columns (total 1 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CNT_FAM_MEMBERS  48744 non-null  float64
dtypes: float64(1)
memory usage: 380.9 KB
None

Convert the anomaly data in 'DAYS_EMPLOYED'.

In [ ]:
# ANOMALY DATA COUNT BEFORE CONVERSION
print('Train set :',len(appl_train[appl_train.DAYS_EMPLOYED==365243]))
print('Test set  :',len(appl_test[appl_test.DAYS_EMPLOYED==365243]))
Train set : 55374
Test set  : 9274
In [ ]:
def conv_daysemp(df,ls_cols):
    df[ls_cols[0]].replace(to_replace=365243,value=-29200,inplace=True)
    return df
In [ ]:
conv_daysemp(appl_train,list_col_conv_daysemp);
conv_daysemp(appl_test,list_col_conv_daysemp);
<ipython-input-276-201878203c00>:2: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[ls_cols[0]].replace(to_replace=365243,value=-29200,inplace=True)
In [ ]:
# ANOMALY DATA COUNT AFTER CONVERSION
print('Train set :',len(appl_train[appl_train.DAYS_EMPLOYED==365243]))
print('Test set  :',len(appl_test[appl_test.DAYS_EMPLOYED==365243]))
Train set : 0
Test set  : 0

Convert the categorical text columns to numerical ones for:

  • CODE_GENDER
  • NAME_EDUCATION_TYPE
  • ORGANIZATION_TYPE
In [ ]:
# BEFORE CONVERSION
print(appl_train[['CODE_GENDER','NAME_EDUCATION_TYPE','ORGANIZATION_TYPE']].head(3))
print(appl_test[['CODE_GENDER','NAME_EDUCATION_TYPE','ORGANIZATION_TYPE']].head(3))
  CODE_GENDER            NAME_EDUCATION_TYPE       ORGANIZATION_TYPE
0           M  Secondary / secondary special  Business Entity Type 3
1           F               Higher education                  School
2           M  Secondary / secondary special              Government
  CODE_GENDER            NAME_EDUCATION_TYPE  ORGANIZATION_TYPE
0           F               Higher education       Kindergarten
1           M  Secondary / secondary special      Self-employed
2           M               Higher education  Transport: type 3
In [ ]:
def conv_gender(df,ls_cols):
    df[ls_cols[0]] = df[ls_cols[0]].map({'XNA':0,'M':1,'F':2}).astype(int)
    return df

def conv_education(df,ls_cols):
    temp_dict = {
        'Lower secondary':1,
        'Secondary / secondary special':2,
        'Incomplete higher':3,
        'Higher education':4,
        'Academic degree':5
    }
    df[ls_cols[0]] = df[ls_cols[0]].map(temp_dict).astype(int)
    return df

def conv_org(df,ls_cols):
    ls_ctr=[i for i in np.arange(1,len(ls_cols)+1)]
    temp_dict = dict(zip(ls_cols,ls_ctr))
    df['ORGANIZATION_TYPE'] = df['ORGANIZATION_TYPE'].map(temp_dict).astype(int)
    return df
In [ ]:
conv_gender(appl_train,list_col_conv_gender);
conv_gender(appl_test,list_col_conv_gender);

conv_education(appl_train,list_col_conv_edu);
conv_education(appl_test,list_col_conv_edu);
In [ ]:
orgtype = sorted(appl_train.ORGANIZATION_TYPE.unique())
conv_org(appl_train,orgtype);
conv_org(appl_test,orgtype);
In [ ]:
# AFTER CONVERSION
print(appl_train[['CODE_GENDER','NAME_EDUCATION_TYPE','ORGANIZATION_TYPE']].head(3))
print(appl_test[['CODE_GENDER','NAME_EDUCATION_TYPE','ORGANIZATION_TYPE']].head(3))
   CODE_GENDER  NAME_EDUCATION_TYPE  ORGANIZATION_TYPE
0            1                    2                  6
1            2                    4                 40
2            1                    2                 12
   CODE_GENDER  NAME_EDUCATION_TYPE  ORGANIZATION_TYPE
0            2                    4                 29
1            1                    2                 43
2            1                    4                 55

Hypothesis Testing¶

7.1 Debt-to-Income Ratio and Income per Child¶

Hypothesis: Users with higher education levels have higher average income.¶

  1. Null Hypothesis (H0):

    There is no significant difference in the Debt-to-Income Ratio based on the Income per Child.

  2. Alternative Hypothesis (H1):

The Debt-to-Income Ratio significantly varies based on the Income per Child.

  1. Test:

    Pearson Correlation Coefficient.

  2. Reason for chosing this Test:

    The Pearson correlation coefficient is appropriate for assessing the strength and direction of the linear relationship between two continuous variables, in this case, the Debt-to-Income Ratio and Income per Child.

In [ ]:
# Create new features
appl_train['debt_income_ratio'] = appl_train['AMT_CREDIT'] / appl_train['AMT_INCOME_TOTAL']
appl_train['income_per_child'] = appl_train['AMT_INCOME_TOTAL'] / (1 + appl_train['CNT_CHILDREN'])

# Step 1: Define significance level
alpha = 0.05

# Step 2: Calculate the Pearson correlation and p-value
correlation, p_value = stats.pearsonr(appl_train['debt_income_ratio'], appl_train['income_per_child'])

# Step 3: Print the results
print(f"Pearson correlation coefficient: {correlation}")
print(f"p-value: {p_value}")

# Step 4: Hypothesis Testing
if p_value <= alpha:
    print("Reject the null hypothesis: Significant correlation exists.")
else:
    print("Fail to reject the null hypothesis: No significant correlation exists.")
Pearson correlation coefficient: -0.13704002646062904
p-value: 0.0
Reject the null hypothesis: Significant correlation exists.
<ipython-input-284-795f3b4a3d3f>:9: DeprecationWarning: Please import `pearsonr` from the `scipy.stats` namespace; the `scipy.stats.stats` namespace is deprecated and will be removed in SciPy 2.0.0.
  correlation, p_value = stats.pearsonr(appl_train['debt_income_ratio'], appl_train['income_per_child'])

Debt-to-Income Ratio by Default Status¶

Visualize the distribution of the debt-to-income ratio across default and non-default users.

In [ ]:
plt.figure(figsize=(8,4))
sns.histplot(x='debt_income_ratio', hue='TARGET', data=appl_train, bins=50, kde=True, palette=['blue', 'white'])
plt.xlim(0, 20)
plt.title('Debt to Income Ratio by Default Status')
plt.xlabel('Debt to Income Ratio')
Out[ ]:
Text(0.5, 0, 'Debt to Income Ratio')
No description has been provided for this image
In [ ]:
plt.figure(figsize=(8, 6))
sns.scatterplot(x='debt_income_ratio', y='income_per_child', data=appl_train)
plt.title('Scatter Plot of Debt-to-Income Ratio vs. Income per Child')
plt.xlabel('Debt-to-Income Ratio')
plt.ylabel('Income per Child')
plt.axhline(0, color='red', linestyle='--')  # Optional: Add a horizontal line for reference
plt.axvline(0, color='red', linestyle='--')  # Optional: Add a vertical line for reference
Out[ ]:
<matplotlib.lines.Line2D at 0x7cfbb0933490>
No description has been provided for this image

Hypothesis Conclusion:¶

Based on the analysis conducted using the Pearson correlation test, we calculated a Pearson correlation coefficient of approximately -0.137 between the debt-to-income ratio and income per child, indicating a weak negative correlation. The p-value obtained is 0.0, which is significantly lower than our chosen significance level of 0.05. Therefore, we reject the null hypothesis and conclude that there is a significant correlation between the debt-to-income ratio and income per child. This suggests that as the debt-to-income ratio changes, there is a statistically significant change in the income per child.

7.2 Income Level and Default Likelihood¶

Hypothesis: Users with lower income levels are more likely to default.¶

  1. Null Hypothesis (H0):

    There is no significant difference in income levels between users who default and those who do not.

  2. Alternative Hypothesis (H1):

There is a significant difference in income levels between users who default and those who do not.

  1. Test:

    Independent Samples T-Test.

  2. Reason for chosing this Test:

    Independent Samples T-Test is appropriate for comparing the means of two independent group users who default and those who do not—to determine if income levels significantly differ between the two groups.

In [ ]:
# Split income data based on target variable
defaulted_income = appl_train[appl_train['TARGET'] == 1]['AMT_INCOME_TOTAL']
non_defaulted_income = appl_train[appl_train['TARGET'] == 0]['AMT_INCOME_TOTAL']

# Perform T-test
t_stat, p_val = ttest_ind(defaulted_income, non_defaulted_income, equal_var=False)
print("T-Test for Income Levels and Default Likelihood")
print(f"T-Statistic: {t_stat:.4f}, P-Value: {p_val:.4f}")
T-Test for Income Levels and Default Likelihood
T-Statistic: -0.7307, P-Value: 0.4650
In [ ]:
plt.figure(figsize=(12, 6))

# Box Plot
plt.subplot(1, 2, 1)
sns.boxplot(x='TARGET', y='AMT_INCOME_TOTAL', data=appl_train, palette='pastel')
plt.title('Box Plot of Income Levels by Default Status')
plt.xlabel('Default Status (0 = No Default, 1 = Default)')
plt.ylabel('Income Level (Total Income)')

# Violin Plot
plt.subplot(1, 2, 2)
sns.violinplot(x='TARGET', y='AMT_INCOME_TOTAL', data=appl_train, palette='pastel')
plt.title('Violin Plot of Income Levels by Default Status')
plt.xlabel('Default Status (0 = No Default, 1 = Default)')
plt.ylabel('Income Level (Total Income)')
plt.tight_layout()
<ipython-input-288-03bd8fa7ddb1>:5: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x='TARGET', y='AMT_INCOME_TOTAL', data=appl_train, palette='pastel')
<ipython-input-288-03bd8fa7ddb1>:12: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.violinplot(x='TARGET', y='AMT_INCOME_TOTAL', data=appl_train, palette='pastel')
No description has been provided for this image

Hypothesis Conclusion:¶

The Independent Samples T-Test revealed a T-Statistic of -0.7307 and a P-Value of 0.4650. Since the P-Value is significantly higher than the common significance level of 0.05, we fail to reject the null hypothesis (H0).

This indicates that there is no significant difference in income levels between users who default and those who do not. Therefore, we conclude that income level does not appear to be a significant factor influencing loan default likelihood in this analysis.

7.3 Education Level and Income¶

Hypothesis: Users with higher education levels have higher average income.¶

  1. Null Hypothesis (H0):

    There is no significant difference in income based on education level.

  2. Alternative Hypothesis (H1):

Income significantly varies by education level.

  1. Test:

    ANOVA (Analysis of Variance).

  2. Reason for chosing this Test:

    ANOVA (Analysis of Variance) is chosen to assess whether there are significant differences in average income across multiple education levels, as it effectively compares means among three or more independent groups.

In [ ]:
education_groups = [group['AMT_INCOME_TOTAL'].values for name, group in appl_train.groupby('NAME_EDUCATION_TYPE')]

# Perform ANOVA
f_stat, p_val = f_oneway(*education_groups)

# Print ANOVA results
print("ANOVA for Income by Education Level")
print(f"F-Statistic: {f_stat:.4f}, P-Value: {p_val:.4f}")

# Visualization: Box Plot for Income by Education Level
plt.figure(figsize=(12, 6))
sns.boxplot(x='NAME_EDUCATION_TYPE', y='AMT_INCOME_TOTAL', data=appl_train, palette='pastel')
plt.title('Box Plot of Income by Education Level')
plt.xlabel('Education Level')
plt.ylabel('Income Level (Total Income)')
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.tight_layout()
ANOVA for Income by Education Level
F-Statistic: 753.2160, P-Value: 0.0000
<ipython-input-289-7505e7a9cf56>:12: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x='NAME_EDUCATION_TYPE', y='AMT_INCOME_TOTAL', data=appl_train, palette='pastel')
No description has been provided for this image

Hypothesis Conclusion:¶

The ANOVA analysis for income by education level yielded an F-Statistic of 753.2160 and a P-Value of 0.0000. Since the P-Value is significantly lower than the common significance level of 0.05, we reject the null hypothesis (H0).

This indicates that there is a significant difference in income based on education level. Consequently, we conclude that users with varying education levels have different average income levels, supporting the hypothesis that higher education levels are associated with higher average incomes.

7.4 Employment Length and Default Likelihood¶

Hypothesis: Users with shorter employment durations are more likely to default.¶

  1. Null Hypothesis (H0):

    There is no significant difference in employment duration between users who default and those who do not.

  2. Alternative Hypothesis (H1):

There is a significant difference in employment duration between users who default and those who do not.

  1. Test:

    Independent Samples T-Test.

  2. Reason for chosing this Test:

    The Independent Samples T-Test is suitable for comparing the means of two independent groups users who default and those who do not—to determine if there is a significant difference in employment length between the two groups.

In [ ]:
defaulted_emp = appl_train[appl_train['TARGET'] == 1]['DAYS_EMPLOYED'].abs()
non_defaulted_emp = appl_train[appl_train['TARGET'] == 0]['DAYS_EMPLOYED'].abs()

# Perform T-test
t_stat, p_val_emp = ttest_ind(defaulted_emp, non_defaulted_emp, equal_var=False)
print("T-Test for Employment Length and Default Likelihood")
print(f"T-Statistic: {t_stat:.4f}, P-Value: {p_val_emp:.4f}")

# Visualization: Box Plot for Employment Length by Default Status
plt.figure(figsize=(12, 6))
sns.boxplot(x='TARGET', y='DAYS_EMPLOYED', data=appl_train, palette='pastel')
plt.title('Box Plot of Employment Length by Default Status')
plt.xlabel('Default Status (0 = No Default, 1 = Default)')
plt.ylabel('Employment Length (Days)')
plt.xticks(ticks=[0, 1], labels=['Non-Defaulted', 'Defaulted'])
plt.tight_layout()
plt.show()

# Interpretation
if p_val_emp < 0.05:
    print("Result: Reject the null hypothesis (significant difference in employment length between groups).")
else:
    print("Result: Fail to reject the null hypothesis (no significant difference in employment length).")
T-Test for Employment Length and Default Likelihood
T-Statistic: -37.4471, P-Value: 0.0000
<ipython-input-290-5982b0c19250>:11: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x='TARGET', y='DAYS_EMPLOYED', data=appl_train, palette='pastel')
No description has been provided for this image
Result: Reject the null hypothesis (significant difference in employment length between groups).

Hypothesis Conclusion:¶

The T-Test for Employment Length and Default Likelihood yielded a T-Statistic of -37.4471 and a P-Value of 0.0000. Since the P-Value is significantly lower than the common significance level of 0.05, we reject the null hypothesis (H0).

This indicates that there is a significant difference in employment length between users who default and those who do not. Therefore, we conclude that users with shorter employment durations are indeed more likely to default.

7.5 Number of Children and Default Likelihood¶

Hypothesis: Users with more children are more likely to default.¶

  1. Null Hypothesis (H0):

    There is no association between the number of children and default likelihood.

  2. Alternative Hypothesis (H1):

There is an association between the number of children and default likelihood.

  1. Test:

    Chi-Square Test

  2. Reason for chosing this Test:

    The Chi-Square Test for Independence is appropriate for assessing the relationship between two categorical variables—in this case, the number of children and default status—to determine if the distribution of defaults varies based on the number of children users have.

In [ ]:
children_contingency = pd.crosstab(appl_train['CNT_CHILDREN'], appl_train['TARGET'])

# Perform Chi-Square test
chi2, p_val_children, dof, expected = chi2_contingency(children_contingency)
print("Chi-Square Test for Number of Children and Default Likelihood")
print(f"Chi2 Statistic: {chi2:.4f}, P-Value: {p_val_children:.4f}")

# Visualization: Bar Plot for Number of Children by Default Status
plt.figure(figsize=(12, 6))
children_contingency.plot(kind='bar', stacked=True, color=['skyblue', 'salmon'])
plt.xlim(0,6)
plt.title('Default Status by Number of Children')
plt.xlabel('Number of Children')
plt.ylabel('Count')
plt.xticks(rotation=0)  # Keep x-axis labels horizontal
plt.legend(title='Default Status', labels=['Non-Defaulted (0)', 'Defaulted (1)'])
plt.tight_layout()

# Interpretation
if p_val_children < 0.05:
    print("Result: Reject the null hypothesis (significant association between number of children and default likelihood).")
else:
    print("Result: Fail to reject the null hypothesis (no significant association).")
Chi-Square Test for Number of Children and Default Likelihood
Chi2 Statistic: 185.4518, P-Value: 0.0000
Result: Reject the null hypothesis (significant association between number of children and default likelihood).
<Figure size 1200x600 with 0 Axes>
No description has been provided for this image

Hypothesis Conclusion:¶

The Chi-Square Test for Number of Children and Default Likelihood yielded a Chi2 Statistic of 185.4518 and a P-Value of 0.0000. Since the P-Value is significantly lower than the common significance level of 0.05, we reject the null hypothesis (H0).

This indicates that there is a significant association between the number of children users have and their likelihood of defaulting on loans. Therefore, we conclude that users with more children are indeed more likely to default. This finding highlights the potential impact of family size on financial stability and default risk.

7.6 Annuity Amount and Loan Default¶

Hypothesis: Higher annuity amounts are associated with a higher likelihood of default.¶

  1. Null Hypothesis (H0):

    There is no significant difference in annuity amount between users who default and those who don’t.

  2. Alternative Hypothesis (H1):

There is a significant difference in annuity amount between users who default and those who don’t.

  1. Test:

    Independent Samples T-Test on AMT_ANNUITY.

  2. Reason for chosing this Test:

    The Independent Samples T-Test is suitable for comparing the means of two independent groups users who default and those who do not—to determine if there is a significant difference in employment length between the two groups.

In [ ]:
# Separate annuity data for defaulted and non-defaulted users
defaulted_annuity = appl_train[appl_train['TARGET'] == 1]['AMT_ANNUITY']
non_defaulted_annuity = appl_train[appl_train['TARGET'] == 0]['AMT_ANNUITY']

# Perform T-test
t_stat, p_val_annuity = ttest_ind(defaulted_annuity.dropna(), non_defaulted_annuity.dropna(), equal_var=False)
print("T-Test for Annuity Amount and Default Likelihood")
print(f"T-Statistic: {t_stat:.4f}, P-Value: {p_val_annuity:.4f}")

plt.figure(figsize=(8, 5))
sns.histplot(data=appl_train, x='AMT_ANNUITY', hue='TARGET', bins=50, kde=True, palette='coolwarm')
plt.title("Annuity Amount Distribution by Default Status")
plt.xlim(0,100000)
plt.xlabel("Annuity Amount")
plt.ylabel("Frequency")
plt.legend(title='Default Status', labels=['Non-Defaulted (0)', 'Defaulted (1)'])
plt.tight_layout()

# Interpretation
if p_val_annuity < 0.05:
    print("Result: Reject the null hypothesis (significant difference in annuity amount between groups).")
else:
    print("Result: Fail to reject the null hypothesis (no significant difference in annuity amount).")
T-Test for Annuity Amount and Default Likelihood
T-Statistic: -8.1473, P-Value: 0.0000
Result: Reject the null hypothesis (significant difference in annuity amount between groups).
No description has been provided for this image

Hypothesis Conclusion:¶

The T-Test for Annuity Amount and Default Likelihood resulted in a T-Statistic of -8.1473 and a P-Value of 0.0000. Since the P-Value is significantly less than the standard significance level of 0.05, we reject the null hypothesis (H0).

This indicates a significant difference in annuity amounts between users who default and those who do not. Therefore, we conclude that higher annuity amounts are indeed associated with a higher likelihood of default, suggesting that annuity amounts are an important factor in evaluating the risk of default among users.

Feature Creation and Feature Engineering¶

Feature engineering is a critical step in machine learning and data analysis where raw data is transformed into meaningful features. This process improves model performance by capturing the underlying patterns in the data, providing the model with structured, relevant inputs for accurate predictions.


How Feature Engineering Helped¶

Feature engineering transformed a complex, fragmented dataset into a structured, unified format that directly addressed our analytical goals:

  1. Captured Critical Metrics: Aggregated data revealed total amounts financed, repayment behaviors, and historical loan performance.
  2. Generated Time-Based Insights: Vintage features highlighted loan durations and credit history trends.
  3. Identified Risk Profiles: Delinquency features provided a clear picture of a client’s payment history, offering insights into their likelihood of default.
  4. Simplified Modeling: Consolidated data ensured models could train on a clean, enriched dataset with meaningful inputs.

Screenshot 2024-12-12 143818.png

Implementation of Feature Engineering¶

We created three primary categories of features: Amount Financed, Vintage (Time Duration), and Delinquency. These were aggregated from multiple tables using statistical functions such as max, min, mean, and sum. In total, we engineered 600+ custom aggregated features, enabling a comprehensive understanding of client behaviors.


1. Amount Financed¶

This category focuses on the actual amount of money clients borrowed across various loans. It includes around 120 features aggregated from different tables. Below are some examples:

  • POS_CASH_balance.csv: Aggregated monthly balances of POS (point of sale) and cash loans:

    • TOTAL_POS_AMT: Total of all POS cash balances.
    • AVG_POS_AMT: Average POS balance across months.
  • credit_card_balance.csv: Monthly balances of credit card loans:

    • MAX_CREDIT_CARD_AMT: Maximum credit card balance.
    • RATIO_CREDIT_USED: Ratio of credit used to credit limit, which helped quantify spending behavior.
  • installments_payments.csv: Repayment details:

    • SUM_INSTALLMENT_PAID: Total repayment amount, which indicated the total financial commitment of the client.
    • AVG_INSTALLMENT_PAID: Average installment payment, useful for gauging consistency in repayments.

👉 Explore the notebook for Amount Financed Features


2. Vintage (Time Duration)¶

Vintage features represent performance metrics of loans and accounts over time. They track clients’ financial stability and credit history trends. This category includes 200+ features. Examples include:

  • bureau.csv: Loan durations from external institutions:

    • MAX_LOAN_DURATION: Maximum duration of loans, highlighting long-term financial commitments.
    • AVG_LOAN_DURATION: Mean loan duration, useful for assessing average borrowing patterns.
  • previous_application.csv: Time gaps and durations of past Home Credit applications:

    • MIN_TIME_SINCE_LAST_APPLICATION: Shortest gap between previous loans, indicating rapid borrowing behavior.
    • AVG_LOAN_VINTAGE: Average duration of previous loans, which revealed financial maturity over time.

👉 Explore the notebook for Vintage Features


3. Delinquency¶

Delinquency features capture past repayment performance, highlighting the success and failures in honoring loans. This category includes 360+ features derived from tables such as:

  • bureau.csv: Delinquency patterns:

    • COUNT_LOANS_DELINQUENT: Number of delinquent loans, a strong indicator of risk.
    • MAX_DELINQUENCY_DAYS: Maximum days past due, revealing the severity of delinquencies.
  • bureau_balance.csv: Monthly bureau balance data:

    • SUM_DELINQUENT_MONTHS: Total months of delinquency, summarizing the extent of missed payments.
    • RATIO_DELINQUENT_MONTHS: Ratio of delinquent months to total credit history, normalizing delinquency across credit duration.

👉 Explore the notebook for Delinquency Features


Aggregation at SK_ID_CURR Level¶

To ensure that each client is represented by a single row, all features were aggregated at the SK_ID_CURR level. The aggregation process involved:

  • Statistical Functions: max, min, mean, and sum to summarize data.
  • Ratios: Relationships such as RATIO_AMT_CREDIT_TO_INCOME were derived for added insights.
  • Time-Based Calculations: Differences between key dates were used to generate durations like loan vintage and repayment delays.

Top Features Selection¶

Selecting the most relevant features is a crucial step in building efficient and interpretable machine learning models. By reducing the dimensionality of the dataset, we can focus on the features that have the highest impact on the predictive performance, improve computational efficiency, and reduce overfitting.

In this project, we employed three key techniques for feature selection to identify the top features from our baseline models.


1. Lasso Feature Selection¶

Lasso (Least Absolute Shrinkage and Selection Operator) regression was used to select features by applying L1 regularization. Lasso assigns weights to features and shrinks less important feature weights to zero, effectively performing feature selection.


2. Recursive Feature Elimination (RFE)¶

Recursive Feature Elimination iteratively trains the model and removes the least important features based on the model’s performance. By repeating this process, RFE selects the optimal set of features.


3. Random Forest Feature Importance¶

Random Forest, an ensemble learning method, was used to evaluate feature importance based on the decrease in impurity when features are split. This technique is robust to overfitting and can handle nonlinear relationships.

👉 Access the full feature selection notebook

Machine Learning Methodology and Architecture - Model Stacking¶

Based on the insights gained from exploratory data analysis and feature engineering, we adopted a robust and systematic methodology for prediction tasks, leveraging the power of model stacking and neural networks to handle the complexity and imbalance in the dataset

Model Stacking in Machine Learning¶

What is Model Stacking?¶

Model stacking is an ensemble learning technique where predictions from multiple base models (called "level-0 models" or baseline models) are combined using another model, known as a meta-model (or "level-1 model"), to make the final predictions.


How it Works¶

  1. Train Base Models: Multiple models (e.g., decision trees, neural networks, or SVMs) are trained on the dataset.
  2. Generate Predictions: The base models make predictions on a validation set or unseen data.
  3. Train Meta-Model: The meta-model is trained using the predictions from the base models as input features, learning how to best combine these predictions.

Why Stacking is Used¶

  1. Leverage Model Diversity: Different models capture different patterns or features in the data.
  2. Reduce Overfitting: Combining predictions reduces the likelihood of overfitting compared to individual models.
  3. Improve Generalization: The meta-model learns from the weaknesses of the base models and makes better predictions on unseen data.

Benefits of Stacking¶

  • Better Generalization: The meta-model reduces bias and variance errors by learning how to combine base model outputs.
  • Flexibility: A wide variety of models (simple or complex) can be used in the stack.
  • Improved Accuracy: Stacking often outperforms individual models and simpler ensemble methods like bagging or boosting.

By allowing the ensemble to learn how to effectively weigh different base models, stacking helps create better generalized models.

Figure Below shows the stacking Architecture¶

TRUST ML Architecture (1).png

Amount Finance Baseline Models (afm1 and afm2)¶

We have attached the links to individual baseline model codes

After attempting several tree based and boosting models like Random Forest, LightGBM, XGboost and getting sub par results, we decided to proceed with Artifical Neural networks for both the models.

afm1 - AMTF_POS_CASH_MODEL¶

  • afm1 link - https://github.com/kautilyaa/TRUST/blob/main/Baseline_Models/SM_pos_cash_Model.ipynb

This model was built on the amount finance features built using POS_CASH_Balance.csv and installments.csv merged with the target values from application train aggregated at SK_ID_CURR level.

8c4d1b40-8342-428a-a0cb-08c813421d7f.jpg

Classification Score¶

WhatsApp Image 2024-12-12 at 18.42.00.jpeg

afm2 - AMTF_CC_MODEL¶

  • afm2 link - https://github.com/kautilyaa/TRUST/blob/main/Baseline_Models/SM_CC_installments_model.ipynb

This model was built on the amount finance features built using credit_card_balance.csv and installments.csv merged with the target values from application train aggregated at SK_ID_CURR level.

8c4d1b40-8342-428a-a0cb-08c813421d7f (1).jpg

Classification Report¶

2d394b6f-58b7-4e4b-b8f7-876ad7a981ac.jpg

Deliquency Baseline Model (bm1)¶

We have attached the links to individual baseline model codes

We intially tried linear classifiers like Logistic Regression and SVM classifier without much luck. Later realised that many of the columns have high correlation which was an added problem along with imbalance target variable. LightGBM gave decent results but the end decision was to go with ANN as before. We tried several differal combinations of hidden layers but went with the architecture given below. Got a decent Test accuracy of about 73% which I deemed good for a weak learner.

bm1 - DEQ_BASELINE_MODEL¶

  • bm1 link - https://github.com/kautilyaa/TRUST/blob/main/Baseline_Models/SM_Deliquency_Model.ipynb

The model was built using features with 2 levels of aggregation utilizing the tables bureau.csv and bureau_balance.csv.

WhatsApp Image 2024-12-12 at 20.07.00.jpeg

Classification Report:¶

WhatsApp Image 2024-12-12 at 18.42.00 (1).jpeg

Vintage Baseline Models (vm1 and vm2)¶

We have attached the links to individual baseline model codes

After attempting several tree based and boosting models like Random Forest, LightGBM, XGboost and getting sub par results, we decided to proceed with Artifical Neural networks for both the models.

vm1 - VIM_BUREAU_MODEL¶

  • vm1 link - {modellink}

Modelwas built using a combination of aggregated vintage features from bureau.csv and application_train.

Classification Report:¶

WhatsApp Image 2024-12-12 at 18.20.47.jpeg

vm2 - VIM_PREV_MODEL¶

  • vm2 link - {modellink}

Model was built using a combination of aggregated vintage features from previous_application.csv and application_train.

Classification Report:¶

WhatsApp Image 2024-12-12 at 18.29.59.jpeg

Final Dataset Preparation¶

The top features from each of the feature groups along with the residuals and predicted probabilities from each baseline model is INNER JOINED with application_train to create the final dataset.

Pyspark code link - https://github.com/kautilyaa/TRUST/blob/main/Feature_Selection/Meta%20Model%20Dataset%20Prep.ipynb

In [ ]:
# appl_train = spark.table('hive_metastore.default.application_train')

# #vintage features
# vin_appl_train = spark.table('hive_metastore.default.vin_finalized_application_train')
# vin_prev = spark.table('hive_metastore.default.ay_vin_prev_pred')
# vin_bur = spark.table('hive_metastore.default.ay_vin_bur_pred')

# #amtf features
# amtf_pos_cash = spark.table('hive_metastore.default.amtf_pos_cash_final_prediction_data2')
# amtf_cc = spark.table('hive_metastore.default.amtf_cc_final_pred_df')

# #deq features
# deq_df = spark.table('hive_metastore.default.bu_final_prediction_data2')


# #Dropping TARGET column from vin_appl_train
# vin_appl_train = vin_appl_train.drop("TARGET")
# vin_appl_train.limit(2).display()

# amtf_cc = amtf_cc.withColumnRenamed("Prediction_Probability_pos_cash", "Prediction_Prob_cc")


# # Perform left outer joins
# result_df = appl_train.join(vin_appl_train, on="SK_ID_CURR", how="left") \
#                       .join(vin_prev, on="SK_ID_CURR", how="left") \
#                       .join(vin_bur, on="SK_ID_CURR", how="left") \
#                       .join(amtf_pos_cash, on="SK_ID_CURR", how="left") \
#                       .join(amtf_cc, on="SK_ID_CURR", how="left") \
#                       .join(deq_df, on="SK_ID_CURR", how="left")

# # Display the final result
# result_df.limit(2).display()

Final Model Curation(Meta Model)¶

In [ ]:
final_data_df = pd.read_csv('/content/drive/MyDrive/UMD/DATA602_DATA_Science/Project/Data Segregator /Code/FinalData/ultimate_op_dataset.csv')

8.1. Handle Missing Values¶

Drop columns with more than 50% missing values¶

In [ ]:
missing_threshold = 0.5
missing_data = final_data_df.isnull().mean()
columns_to_drop = missing_data[missing_data > missing_threshold].index
final_data_df = final_data_df.drop(columns=columns_to_drop)
In [ ]:
len(columns_to_drop)
Out[ ]:
55
In [ ]:
final_data_df.shape
Out[ ]:
(307511, 123)

Impute missing values for numeric columns with median and categorical with¶


Replaces NaN values:
- For numerical columns: Based on the column's minimum value.
  - If min value is 0, replace NaN with 10 - 1 = 9.
  - If min value is negative, replace NaN with the nearest power of 10 + 1.
- For categorical columns: Replace NaN with "UNK".

Parameters:
data (pd.DataFrame): Input DataFrame.

Returns:
pd.DataFrame: DataFrame with NaN values replaced.
In [ ]:
def handle_nan(data):

    for column in data.columns:
        if data[column].dtype in ['float64', 'int64']:
            # Numerical column
            min_val = data[column].min(skipna=True)

            if min_val == 0:
                replacement_value = - 1
            elif min_val < 0:
                nearest_negative_power = -10 ** (np.ceil(np.log10(abs(min_val)))+0)
                replacement_value = nearest_negative_power*10 + 1
            else:
                # If min_val > 0 (unexpected), fallback to a default replacement
                replacement_value = -10  # Default for unforeseen cases

            # Replace NaN with the calculated replacement value
            # data[column].fillna(replacement_value, inplace=True)
            data[column] = data[column].fillna(replacement_value)
        elif data[column].dtype == 'object':
            # Categorical column: Replace NaN with "UNK"
            # data[column].fillna("UNK", inplace=True)
            data[column] = data[column].fillna("UNK")
    return data

# Apply the function to your dataset
final_data_df = handle_nan(final_data_df)

# Validate the changes
print("Missing Values After Handling:")
print(final_data_df.isnull().sum())
Missing Values After Handling:
SK_ID_CURR                            0
TARGET                                0
NAME_CONTRACT_TYPE                    0
CODE_GENDER                           0
FLAG_OWN_CAR                          0
                                     ..
b_credit_AMT_CREDIT_DEBT_DIFF_mean    0
b_credit_DAYS_CREDIT_mean             0
b_active_DAYS_CREDIT_mean             0
DEQ_AVG_COUNT_DPD0P_36MOB_ALL         0
DEQ_AVG_COUNT_DPD0P_3MOB_ALL          0
Length: 123, dtype: int64
In [ ]:
final_data_df.head()
Out[ ]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... b_DAYS_CREDIT_mean b_720_DAYS_CREDIT_PLAN_sum b_consumer_DAYS_CREDIT_ENDDATE_mean b_credit_AMT_CREDIT_SUM_DEBT_sum b_CNT_CREDIT_PROLONG_sum b_credit_AMT_CREDIT_DEBT_DIFF_mean b_credit_DAYS_CREDIT_mean b_active_DAYS_CREDIT_mean DEQ_AVG_COUNT_DPD0P_36MOB_ALL DEQ_AVG_COUNT_DPD0P_3MOB_ALL
0 100011 0 Cash loans F N Y 0 112500.0 1019610.0 33826.5 ... -1773.000000 -999999.0 -1437.666667 0.0 0.0 -54000.0 -1960.00 -99999.00 0.0 0.0
1 100023 0 Cash loans F N Y 1 90000.0 544491.0 17563.5 ... -1164.384615 3652.0 -264.333333 41881.5 0.0 -47236.5 -1275.25 -1141.25 0.0 0.0
2 100036 0 Cash loans F N Y 0 112500.0 512064.0 25033.5 ... -889.000000 -999999.0 -462.000000 -99999999.0 0.0 -999999999.0 -99999.00 -889.00 0.0 0.0
3 100043 0 Cash loans F N Y 2 198000.0 641173.5 23157.0 ... -1904.000000 -999999.0 -2453.000000 0.0 0.0 -306000.0 -1169.00 -99999.00 0.0 0.0
4 100047 1 Cash loans M N Y 0 202500.0 1193580.0 35028.0 ... -1104.800000 2557.0 701.666667 692253.0 0.0 62253.0 -956.00 -857.00 0.0 0.0

5 rows × 123 columns

8.2. Handle Outliers¶

Cap outliers at 1st and 99th percentiles for numeric columns¶

In [ ]:
# 2. Handle Outliers
# Cap outliers at 1st and 99th percentiles for numeric columns
numeric_cols = final_data_df.select_dtypes(include=['float64', 'int64']).columns
for col in numeric_cols:
    lower_bound = final_data_df[col].quantile(0.05)
    upper_bound = final_data_df[col].quantile(0.95)
    final_data_df[col] = np.clip(final_data_df[col], lower_bound, upper_bound)
final_data_df.describe(include='all')
Out[ ]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... b_DAYS_CREDIT_mean b_720_DAYS_CREDIT_PLAN_sum b_consumer_DAYS_CREDIT_ENDDATE_mean b_credit_AMT_CREDIT_SUM_DEBT_sum b_CNT_CREDIT_PROLONG_sum b_credit_AMT_CREDIT_DEBT_DIFF_mean b_credit_DAYS_CREDIT_mean b_active_DAYS_CREDIT_mean DEQ_AVG_COUNT_DPD0P_36MOB_ALL DEQ_AVG_COUNT_DPD0P_3MOB_ALL
count 307511.000000 307511.000000 307511 307511 307511 307511 307511.000000 307511.000000 3.075110e+05 307511.000000 ... 307511.000000 307511.000000 307511.000000 3.075110e+05 307511.000000 3.075110e+05 307511.000000 307511.000000 307511.000000 307511.000000
unique NaN NaN 2 3 2 2 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
top NaN NaN Cash loans F N Y NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
freq NaN NaN 278232 202448 202924 213312 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
mean 278183.072462 0.080729 NaN NaN NaN NaN 0.400509 163167.915380 5.846019e+05 26598.793372 ... -15247.987454 -336709.799864 -200052.930561 -4.755879e+07 -0.143149 -4.762598e+08 -44591.207384 -29846.247700 -0.100255 -0.133198
std 101331.601166 0.272419 NaN NaN NaN NaN 0.664724 72517.568752 3.563103e+05 12481.993515 ... 34644.575222 476031.173670 399518.102108 5.000294e+07 0.350226 4.993877e+08 49169.245572 45255.788575 0.393721 0.356365
min 117945.500000 0.000000 NaN NaN NaN NaN 0.000000 67500.000000 1.350000e+05 9000.000000 ... -99999.000000 -999999.000000 -999999.000000 -1.000000e+08 -1.000000 -1.000000e+09 -99999.000000 -99999.000000 -1.000000 -1.000000
25% 189145.500000 0.000000 NaN NaN NaN NaN 0.000000 112500.000000 2.700000e+05 16524.000000 ... -1719.477273 -999999.000000 -1687.000000 -1.000000e+08 0.000000 -1.000000e+09 -99999.000000 -99999.000000 0.000000 0.000000
50% 278202.000000 0.000000 NaN NaN NaN NaN 0.000000 147150.000000 5.135310e+05 24903.000000 ... -1171.250000 730.000000 -720.333333 0.000000e+00 0.000000 -3.331350e+05 -1791.750000 -801.666667 0.000000 0.000000
75% 367142.500000 0.000000 NaN NaN NaN NaN 1.000000 202500.000000 8.086500e+05 34596.000000 ... -734.633333 2424.000000 -136.571429 6.300000e+04 0.000000 -4.912992e+04 -757.000000 -390.000000 0.000000 0.000000
max 438427.500000 1.000000 NaN NaN NaN NaN 2.000000 337500.000000 1.350000e+06 53325.000000 ... -262.000000 29273.000000 674.000000 4.267385e+05 0.000000 0.000000e+00 -227.000000 -120.000000 0.583333 0.166667

11 rows × 123 columns

8.3. Convert Data Types¶

Convert binary categorical variables (e.g., FLAG_OWN_CAR) to 0/1¶

In [ ]:
binary_map = {'Y': 1, 'N': 0}
binary_cols = ['FLAG_OWN_CAR', 'FLAG_OWN_REALTY']
for col in binary_cols:
    if col in final_data_df.columns:
        final_data_df[col] = final_data_df[col].map(binary_map)

# Convert other categorical variables into one-hot encoding
categorical_cols = final_data_df.select_dtypes(include=['object']).columns
final_data_df = pd.get_dummies(final_data_df, columns=categorical_cols, drop_first=True)

8.4. Scale Numeric Variables¶

Use StandardScaler for features like income, credit, etc.¶

In [ ]:
#get residual and probablity column and add to exclude list
prob_data_col_name = [col for col in final_data_df.columns if 'prob' in col or 'Probability' in col]
res_data_col_name = [col for col in final_data_df.columns if 'residual' in col]
In [ ]:
#get numerical features
numeric_cols = final_data_df.select_dtypes(include=['float64', 'int64']).columns
numeric_cols = numeric_cols.difference(['TARGET', 'SK_ID_CURR']) # Exclude these columns
numeric_cols = numeric_cols.difference(prob_data_col_name)
numeric_cols = numeric_cols.difference(res_data_col_name)
numeric_cols
Out[ ]:
Index(['AMT_ANNUITY', 'AMT_CREDIT', 'AMT_GOODS_PRICE', 'AMT_INCOME_TOTAL',
       'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_HOUR',
       'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',
       'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_YEAR',
       ...
       'vin_avg_time_between_loans_min',
       'vin_days_since_credit_update_min_max',
       'vin_days_since_first_loan_taken_max_sum',
       'vin_days_since_last_document_update_max',
       'vin_last_loan_cash_loans_loan_sum',
       'vin_last_loan_consumer_loans_loan_sum',
       'vin_months_since_last_approved_max_sum', 'vin_num_active_loans_sum',
       'vin_total_loan_amount_consumer_loans_min_sum',
       'vin_total_loan_amount_xna_std_sum'],
      dtype='object', length=102)
In [ ]:
scaler = StandardScaler()

scaled_numeric_data = pd.DataFrame(
    scaler.fit_transform(final_data_df[numeric_cols]),
    columns=numeric_cols,
    index=final_data_df.index
)

# Drop the original non-scaled numeric columns and concatenate the scaled ones
remainder_data = final_data_df.drop(columns=numeric_cols)
final_data_df = pd.concat([scaled_numeric_data, remainder_data], axis=1)
final_data_df.describe(include='all')
Out[ ]:
AMT_ANNUITY AMT_CREDIT AMT_GOODS_PRICE AMT_INCOME_TOTAL AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_YEAR ... ORGANIZATION_TYPE_Trade: type 6 ORGANIZATION_TYPE_Trade: type 7 ORGANIZATION_TYPE_Transport: type 1 ORGANIZATION_TYPE_Transport: type 2 ORGANIZATION_TYPE_Transport: type 3 ORGANIZATION_TYPE_Transport: type 4 ORGANIZATION_TYPE_University ORGANIZATION_TYPE_XNA EMERGENCYSTATE_MODE_UNK EMERGENCYSTATE_MODE_Yes
count 3.075110e+05 3.075110e+05 3.075110e+05 3.075110e+05 3.075110e+05 3.075110e+05 3.075110e+05 3.075110e+05 3.075110e+05 3.075110e+05 ... 307511 307511 307511 307511 307511 307511 307511 307511 307511 307511
unique NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 2 2 2 2 2 2 2 2 2 2
top NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... False False False False False False False False False False
freq NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 306880 299680 307310 305307 306324 302113 306184 252137 161756 305183
mean -5.402242e-17 4.477992e-17 3.068511e-17 -2.195094e-16 -1.878076e-16 -1.878076e-16 -3.660031e-17 -6.210961e-17 -1.878076e-16 3.844881e-17 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
std 1.000002e+00 1.000002e+00 1.000002e+00 1.000002e+00 1.000002e+00 1.000002e+00 1.000002e+00 1.000002e+00 1.000002e+00 1.000002e+00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
min -1.409937e+00 -1.261829e+00 -1.188593e+00 -1.319240e+00 -2.531109e+00 -2.531109e+00 -1.912958e+00 -1.605835e+00 -2.531109e+00 -1.360678e+00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
25% -8.071475e-01 -8.829451e-01 -8.754227e-01 -6.986996e-01 3.950837e-01 3.950837e-01 -1.383378e-02 -1.241621e-01 3.950837e-01 -7.988704e-01 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
50% -1.358594e-01 -1.994638e-01 -2.354655e-01 -2.208836e-01 3.950837e-01 3.950837e-01 -1.383378e-02 -1.241621e-01 3.950837e-01 -2.370632e-01 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
75% 6.407005e-01 6.288016e-01 4.589562e-01 5.423810e-01 3.950837e-01 3.950837e-01 -1.383378e-02 -1.241621e-01 3.950837e-01 8.865512e-01 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
max 2.141184e+00 2.148126e+00 2.351596e+00 2.404002e+00 3.950837e-01 3.950837e-01 1.885290e+00 2.839183e+00 3.950837e-01 2.010165e+00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

11 rows × 226 columns

8.5. Remove Redundant Features¶

Check correlations and drop highly correlated columns¶

In [ ]:
# Compute the correlation matrix
correlation_matrix = final_data_df.corr()

# Identify highly correlated variables (correlation > 0.9)
high_corr_vars = correlation_matrix[(correlation_matrix > 0.9) & (correlation_matrix != 1.0)].stack().index

# Extract the second column (potentially redundant features)
redundant_features = set([col[1] for col in high_corr_vars])

# Define columns to protect (like "residual" and "prob")
# protected_features = [col for col in final_data_df.columns if 'residual' in col or 'prob' in col]
protected_features = [col for col in final_data_df.columns if 'residual' in col]
# Remove only redundant features that are NOT protected and probablity rows
features_to_drop = [col for col in redundant_features if col not in protected_features or 'prob' in col]

# Drop the redundant features from the dataset
final_data_df_select = final_data_df.drop(columns=features_to_drop, errors='ignore')

# Validate the shape of the new dataset
print(f"Original Dataset Shape: {final_data_df.shape}")
print(f"Reduced Dataset Shape: {final_data_df_select.shape}")

# Plot the correlation matrix
plt.figure(figsize=(12, 10))
plt.title("Correlation Matrix of Selected Features", fontsize=16)
plt.imshow(correlation_matrix, cmap='mako', interpolation='nearest')
plt.colorbar()
plt.show()
Original Dataset Shape: (307511, 226)
Reduced Dataset Shape: (307511, 190)
No description has been provided for this image

8.6. Model Training¶

In [ ]:
# Splitting the data
X = final_data_df_select.drop(columns=['TARGET', 'SK_ID_CURR'])  # Features
y = final_data_df_select['TARGET']  # Target
y = y.astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Taking here the standard values but we tried other values they all were similar

# Logistic Regression
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
y_proba = logreg.predict_proba(X_test)[:, 1]

# Evaluation
print(classification_report(y_test, y_pred))
print("AUC-ROC:", roc_auc_score(y_test, y_proba))
              precision    recall  f1-score   support

           0       0.98      1.00      0.99     56539
           1       1.00      0.77      0.87      4964

    accuracy                           0.98     61503
   macro avg       0.99      0.88      0.93     61503
weighted avg       0.98      0.98      0.98     61503

AUC-ROC: 0.9631073900640831

Model Performance Analysis¶

Overall Performance¶

  • Accuracy: The model achieves an impressive accuracy of 0.98, meaning it correctly classifies 98% of the samples. However, with an imbalanced dataset (as mentioned earlier), accuracy can be misleading.
  • AUC-ROC: The Area Under the Receiver Operating Characteristic Curve (AUC-ROC) is 0.96, indicating excellent discriminatory power. This means the model is very good at distinguishing between the two classes (0 and 1).

Class-wise Performance¶

Class 0 (Loan Repaid)¶
  • Precision: 0.98 - For samples predicted as class 0, the model is correct 98% of the time.
  • Recall: 1.00 - The model correctly identifies almost all actual class 0 samples.
  • F1-Score: 0.99 - A high F1-score reflects a good balance between precision and recall for class 0.
Class 1 (Loan Defaulted)¶
  • Precision: 1.00 - For samples predicted as class 1, the model is correct 100% of the time. This indicates a very low rate of false positives for this class.
  • Recall: 0.77 - The model correctly identifies 77% of the actual class 1 samples. This might be an area for improvement.
  • F1-Score: 0.87 - While lower than the F1-score for class 0, it is still relatively high, suggesting a decent balance between precision and recall for class 1.

Support¶

  • Class 0 (Loan Repaid): 56,539 samples.
  • Class 1 (Loan Defaulted): 4,964 samples.
  • The dataset imbalance is evident, with significantly more samples in class 0 than in class 1.

Macro and Weighted Averages¶

  • Macro Average:

    • Calculates the average of precision, recall, and F1-score across both classes, treating them equally.
  • Weighted Average:

    • Calculates the weighted average of precision, recall, and F1-score, considering the number of samples in each class. It gives more weight to the majority class (class 0 in this case).

Key Insights and Potential Improvements¶

  1. High Performance but Imbalanced:

    • The model demonstrates strong overall performance, particularly in identifying loan repayments (class 0).
    • However, the recall for loan defaults (class 1) is relatively lower, suggesting some defaults might be missed.
  2. Focus on Recall for Class 1:

    • Given the importance of identifying potential loan defaults, consider strategies to improve the recall for class 1.
    • This could involve:
      • Adjusting the model's classification threshold.
      • Using different algorithms.
      • Employing techniques to handle class imbalance, such as oversampling, undersampling, or cost-sensitive learning.
  3. Trade-off Between Precision and Recall:

    • Explore the trade-off between precision and recall, especially for class 1.
    • A lower precision might be acceptable if it leads to a significant increase in recall, as identifying more potential defaults is crucial.
  4. Further Analysis:

    • Analyze misclassified samples to understand the characteristics of cases where the model makes errors.
    • This can provide insights into potential biases or areas for improvement.

Conclusion¶

  • The model shows promising results with high accuracy and AUC-ROC.
  • However, given the imbalanced dataset and the importance of identifying loan defaults, focusing on improving the recall for class 1 is recommended.
  • This can be achieved through the techniques mentioned above.
  • Further analysis of misclassified samples can provide valuable insights for model refinement.

Visualizations¶

In [ ]:
pred_proba = logreg.predict_proba(X_test)[:, 1]
pred_proba
Out[ ]:
array([7.24117649e-03, 4.52342533e-02, 8.10559343e-03, ...,
       4.99365034e-02, 8.86918137e-13, 2.61316840e-02])
In [ ]:
# Create the scatter plot
plt.figure(figsize=(10, 6))
scatter = plt.scatter(
    np.arange(len(pred_proba)),  # Use the index as x-axis
    pred_proba,                 # Use predicted probabilities as y-axis
    c=y_test,                   # Color points by the actual values
    cmap='coolwarm',            # Color map (red/blue for binary classification)
    alpha=0.7,                  # Transparency
    # edgecolor='k'               # Add a black border to points for better visibility
)

# Add colorbar for clarity
plt.colorbar(scatter, label='Actual Values (y_test)')

# Plot settings
plt.title('Predicted Probabilities Colored by Actual Values', fontsize=16)
plt.xlabel('Index', fontsize=14)
plt.ylabel('Predicted Probabilities', fontsize=14)
plt.grid(alpha=0.3)
plt.show()
No description has been provided for this image
In [ ]:
# Sort the predicted probabilities and corresponding y_test
sorted_indices = np.argsort(pred_proba)
# # Sort the predicted probabilities and corresponding y_test
sorted_pred_proba = pred_proba[sorted_indices]
sorted_y_test = y_test.iloc[sorted_indices]
# Create the scatter plot
plt.figure(figsize=(10, 6))
scatter = plt.scatter(
    np.arange(len(sorted_pred_proba)),  # Use the index of sorted predictions as x-axis
    sorted_pred_proba,                 # Use sorted predicted probabilities as y-axis
    c=sorted_y_test,                   # Color points based on the sorted actual values
    cmap='coolwarm',                   # Color map (red/blue for binary classification)
    alpha=0.7,                         # Transparency
    # edgecolor='k'                      # Add a black border to points for better visibility
)

# Add colorbar for clarity
plt.colorbar(scatter, label='Actual Values (y_test)')

# Plot settings
plt.title('Sorted Predicted Probabilities Colored by Actual Values', fontsize=16)
plt.xlabel('Sorted Index', fontsize=14)
plt.ylabel('Predicted Probabilities', fontsize=14)
plt.grid(alpha=0.3)
plt.show()
No description has been provided for this image
In [ ]:
pred_proba = logreg.predict_proba(X_train)[:, 1]
In [ ]:
quantiles = [0.10, 0.25, 0.50, 0.625, 0.85, 0.90]

# Calculate the quantiles using numpy.quantile()
quantile_values = np.quantile(pred_proba, quantiles)

# Print the results
for quantile, value in zip(quantiles, quantile_values):
    print(f"{quantile*100:.0f}th percentile: {value:.4f}")
10th percentile: 0.0020
25th percentile: 0.0095
50th percentile: 0.0259
62th percentile: 0.0368
85th percentile: 0.0842
90th percentile: 0.1342
In [ ]:
y_pred_proba = logreg.predict_proba(X_test)[:, 1]
In [ ]:
# Adjust thresholds to split the range into 6 equal intervals
adjusted_thresholds = quantile_values # 6 classes => 7 boundary points

# Assign risk levels (R1 to R6) based on the adjusted thresholds
risk_labels = ['R1', 'R2', 'R3', 'R4', 'R5', 'R6']
risk_levels = np.digitize(y_pred_proba, adjusted_thresholds, right=True)
risk_levels = np.clip(risk_levels, 0, len(risk_labels) - 1)  # Ensure valid index range

# Map risk levels to labels
# Flatten risk_levels to a 1D array before using as index
assigned_risks = [risk_labels[level] for level in risk_levels.flatten()]



# Plot the probabilities with their assigned risk levels based on equal intervals
plt.figure(figsize=(10, 6))

# Scatter plot for probabilities and their risk levels
plt.scatter(y_pred_proba, risk_levels + 1, color='blue', label='Predicted Probabilities', s=100)

# Label each point with its risk level
# for prob, level in zip(y_pred_proba, assigned_risks):
#     plt.text(prob, risk_labels.index(level) + 1.1, f"{level}", fontsize=10, ha='center', color='red')

# Add threshold lines
for threshold, label in zip(adjusted_thresholds, risk_labels):
    plt.axvline(threshold, color='gray', linestyle='--', alpha=0.7)
    plt.text(threshold, 6.5, label, rotation=90, verticalalignment='center', color='green')

# Configure plot
plt.title("Predicted Probabilities and Risk Levels (Equal Intervals)", fontsize=14)
plt.xlabel("Predicted Probability", fontsize=12)
plt.ylabel("Risk Level (R1 to R6)", fontsize=12)
plt.yticks(range(1, 7), risk_labels)
plt.grid(True, linestyle='--', alpha=0.6)
plt.legend(["Risk Thresholds", "Predicted Probabilities"])
plt.show()
No description has been provided for this image
In [ ]:
# Calculate the count of each risk level
risk_level_counts = pd.Series(assigned_risks).value_counts()

# Bar plot for the counts of each risk level
plt.figure(figsize=(8, 5))
risk_level_counts.sort_index().plot(kind="bar", color="orange", edgecolor="black")

# Customize the plot
plt.title("Counts of Each Risk Level", fontsize=14)
plt.xlabel("Risk Level", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.xticks(rotation=0)
plt.grid(axis="y", linestyle="--", alpha=0.7)
plt.show()
No description has been provided for this image
In [ ]:
# Here  taking one random point from test_Data
random_index = np.random.randint(0, len(X_test))
random_index
Out[ ]:
8637
In [ ]:
random_data_point = X_test.iloc[random_index]
random_data_point
Out[ ]:
224919
AMT_ANNUITY -1.409937
AMT_INCOME_TOTAL -1.31924
AMT_REQ_CREDIT_BUREAU_MON -0.013834
AMT_REQ_CREDIT_BUREAU_QRT -0.124162
AMT_REQ_CREDIT_BUREAU_YEAR -0.237063
... ...
ORGANIZATION_TYPE_Transport: type 3 False
ORGANIZATION_TYPE_Transport: type 4 False
ORGANIZATION_TYPE_University False
EMERGENCYSTATE_MODE_UNK False
EMERGENCYSTATE_MODE_Yes False

188 rows × 1 columns


In [ ]:
residual_data_col_name = [col for col in final_data_df.columns if 'residual' in col]
column_values = random_data_point[residual_data_col_name]
column_values
Out[ ]:
224919
residual_prev 0.498647
residual_bur 0.461597
residual_posh_cash 0.437515
residual 0.486345

In [ ]:
# pass it to logreg
pred_val = logreg.predict_proba(random_data_point.values.reshape(1, -1))
print(pred_val)

# Get the predicted risk level index (0 to 5) from pred_val
predicted_risk_level_index = np.argmax(pred_val)

# Access risk_labels using the integer index
print(risk_labels[predicted_risk_level_index])
#assign the value of risk band
[[0.23959642 0.76040358]]
R2

To pass Real-World Data¶

In [ ]:
data_point_main_df = final_data_df.loc[X_test.index[random_index]]
data_point_main_df
Out[ ]:
224919
AMT_ANNUITY -1.409937
AMT_CREDIT -1.135534
AMT_GOODS_PRICE -1.052432
AMT_INCOME_TOTAL -1.31924
AMT_REQ_CREDIT_BUREAU_DAY 0.395084
... ...
ORGANIZATION_TYPE_Transport: type 4 False
ORGANIZATION_TYPE_University False
ORGANIZATION_TYPE_XNA False
EMERGENCYSTATE_MODE_UNK False
EMERGENCYSTATE_MODE_Yes False

226 rows × 1 columns


In [ ]:
# get all proba features
prob_data_col_name = [col for col in final_data_df.columns if 'prob' in col or 'Probability' in col]
data_point_main_df_val = data_point_main_df[prob_data_col_name]
data_point_main_df_val
Out[ ]:
224919
pred_proba_prev 0.501353
pred_proba_bur 0.538403
Prediction_Probability_pos_cash 0.562485
Prediction_Probability 0.510933

In [ ]:
#get residual from pred_val
residual_data_col_name = [col for col in final_data_df.columns if 'residual' in col]
#create mapping of proba blity with corresponding residula
prob_residual_dict = {}
for i in range(len(prob_data_col_name)):
    prob_residual_dict[prob_data_col_name[i]] = residual_data_col_name[i]
prob_residual_dict
Out[ ]:
{'pred_proba_prev': 'residual_prev',
 'pred_proba_bur': 'residual_bur',
 'Prediction_Probability_pos_cash': 'residual_posh_cash',
 'Prediction_Probability': 'residual'}
In [ ]:
def func_normalize(val):
  if val < 0.5:
    val=1-val
  return 2*val-1
In [ ]:
data_point_main_df_val = data_point_main_df[prob_data_col_name]
# based on dictionary replace the values of random_data_point residual with data_point_main_df[prob_data_col_name]
for key, value in prob_residual_dict.items():
    random_data_point[value] = func_normalize(data_point_main_df_val[key])
In [ ]:
# pass it to logreg
pred_val = logreg.predict_proba(random_data_point.values.reshape(1, -1))
print(pred_val)

# Get the predicted risk level index (0 to 5) from pred_val
predicted_risk_level_index = np.argmax(pred_val)

# Access risk_labels using the integer index
print(risk_labels[predicted_risk_level_index])
[[0.45976428 0.54023572]]
R2
In [ ]:

Conclusion¶

This project aimed to predict loan defaults using the Home Credit Default Risk dataset, addressing challenges such as high class imbalance and data complexity. Through a structured approach of feature engineering, model stacking, and targeted analysis, we achieved robust results with meaningful insights.


Key Steps and Insights¶

  1. Feature Engineering:

    • We created three logical feature groups:
      • Amount Financed: Derived from POS cash, credit card balances, and installment payments.
      • Delinquency: Captured loan repayment behavior using bureau and bureau balance tables.
      • Vintage: Highlighted historical trends and loan consistency from previous applications.
    • These groups enabled us to capture applicant-specific financial behavior effectively.

    Example: Features like SUM DELINQUENT MONTHS and RATIO CREDIT USED provided insights into past loan performance and credit utilization.

  2. Baseline Models and Stacking:

    • Separate baseline models were trained on each feature group to avoid noise and collinearity.
    • Outputs from the baseline models were stacked together with demographic features from the main application table.
    • A logistic regression meta-model was trained on the stacked outputs, improving prediction accuracy.

    Visualization:

    • The approach diagram clearly shows how raw data was processed, features aggregated, and models stacked for final predictions.

    approach.jpg Approach Diagram

  3. Model Performance:

    • Our stacked approach effectively handled class imbalance, ensuring that both classes (default and non-default) were predicted accurately.

    Performance Metrics:

    • Precision: 0.98 for class 0, 1.00 for class 1.
    • Recall: 1.00 for class 0, 0.77 for class 1.
    • AUC-ROC Score: 0.9631, demonstrating strong predictive capability.

    Visuals:

    • The confusion matrix and classification report highlight our model's ability to balance performance across both classes.
    • Predicted probabilities show a clear separation between default (red) and non-default (blue) cases. confu_matrix_final_model.jpg

Confusion Matrix¶

classi_report_final.jpg

Classification Report¶

pred_value_vs_true_val.jpg

Predicted vs Actual Values¶

pred_value_vs_true_val_sorted.jpg

Sorted Predicted Probabilities¶

  1. Insights into Risk Bands:
    • Final outputs categorized users into risk bands (R1 to R6) for easier interpretation.
    • Distribution of Risk Levels: Bands like R3 and R5 had higher loan defaults, providing actionable insights for risk management. risk_bands_distribution.jpg

Risk_bands_distribution¶


Final Thoughts¶

This project highlights how thoughtful feature engineering, model stacking, and domain knowledge can uncover insights even in highly imbalanced datasets. By breaking down data into meaningful groups, training baseline models, and combining them into a meta-model, we achieved high accuracy and reliable predictions.

These results can help financial institutions make better lending decisions, reduce risk, and ensure responsible credit assessment.

For a detailed look at the code, methodology, and full analysis, please refer to our GitHub Repository.